## Supervised Learning Series

# Part 2: Logistic regression

In the previous post we discussed the fitting of a linear model to a set of input/output points - otherwise known as *linear regression*.  In general all sorts of nonlinear phenomenon present themselves and the data they generate - whose input and output share a nonlinear relationship - are poorly modeled using a linear model, thus causing linear regression to perform rather poorly.  This naturally leads to the exploration of fitting *nonlinear* functions to data, referred to in general as *nonlinear regression*.

In this post we describe a very particular form of nonlinear regression called *logistic regression* that is designed to deal with a very particular kind of dataset that is commonly dealt with in machine learning / deep learning: *two-class classification data*.  This sort of data is distinguished by the fact that its output values are constrained to be either one of two fixed values.  As we will see, such a constraint naturally leads to the choice of a *logistic sigmoid function* as the ideal nonnlinear function to fit to such data, hence the name *logistic regression*.

In [1]:
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import superlearn_library as superlearn
from mlrefined_libraries import math_optimization_library as optlib
import autograd.numpy as np
import math
import pandas as pd
%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

# 1.  Setting the scene

in this Section we set the scene for logistic regression by describing the problem setup and how linear regression - as well as reasonable extensions of it - naturally fail with such data.

## 1.1 The data

Two class classification is a particular instance of *regression* or *surface-fitting*, wherein the output of a dataset of $P$ points $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$ is no longer continuous but takes on two fixed numbers.  The actual value of these numbers is in principle arbitrary, but particular value pairs are more helpful than others for derivation purposes (i.e., it is easier to determine a proper nonlinear function to regress on the data for particular output value pairs).  We will typically use $y_{p}\in\left\{ -1,\,+1\right\}$ - that is every output takes on either the value $+1$ or $-1$.  Often in the context of classification the output values $y_p$ are called *labels*, and all points sharing the same label value are referred to as a *class* of data.  Hence a dataset containing points with label values $y_{p}\in\left\{ -1,\,+1\right\}$ is said to be a dataset consisting of two classes.

The simplest shape such a dataset can take is that of a set of linearly separated adjacent 'steps', as illustrated in the Figure below.  Here the 'bottomm' step is the region of the space containing most of the points that have label value $y_p = -1$.  The 'top step' likewise contains most of the points having label value $y_p = +1$.  These steps are largely separated by a point when $N = 2$, a line when $N = 3$, and a hyperplane when $N$ is larger (the term 'hyperplane' is also used more generally to refer to a point or line as well).  

As shown in the figure, because its output takes on a discrete set of values one can view a classification dataset 'from above'.  In this 'separator' perspective we remove the vertical $y$ dimension of the data and visually represent the dataset using its input only, displaying the output values of each point by coloring its input one of two unique colors (we choose blue for points with label $y_p = -1$, and red for those having label $y_p = +1$).  This separator view allows us to more easily identify the linear boundary that (mostly) separates the two classes.

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_10.png' width="60%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> RELABEL AXES WITH X / Y  NOTATION IN ALL PANELS, ANNOTATE BOUNDARY IN EACH CASE'S LOWER PANEL Classification
from a regression/surface-fitting perspective for single-input (left panels) and two-input (right panels) toy datasets. This surface-fitting view is equivalent to the 'separator' perspective where we look at each respective dataset 'from above'.  In this perspective we can more easily identify the separating hyperplane, i.e., where the step function (shown here in yellow) transitions from its bottom to top step.  In the separator view the vertical axis (the actual $y$ value) is removed, with all points being represented by their input values only.  The label of each point is then represented by color: the points in blue denote label value $+1$ while those in red have label $-1$. </em>  </figcaption> 
</figure>

This is the simplest sort of dataset with binary output we could aim to perform regression on, as in general the boundary between the two classes of data could certainly be nonlinear.  We will deal with this more general potentiality later on - when discussing nerual networks, trees, and kernel-based methods - but first let us deal with the current scenario.  How can we perform regression on a dataset like the ones described in the Figure above?

## 1.2  Trying to fit a discontinuous step function

How can we fit a regression to data that is largely distributed on two adjacent steps separated by a hyperplane?  Lets look at a simple instance of such a dataset when $N = 2$ to build our intuition abuot what must be done in general.

Intuitively it is obvious that simply fitting a line of the form $y = w_0 + w_1x_{\,}$ to such a dataset will result in an extremely subpar results - the line by iteslf simply too unflexible to account for the nonlinearity present in the data.  A dataset that is roughly distributed on two steps needs to be fit with a function that matches this general shape.  In other words such data needs to be fit with a *step function*.

Since the boundary between the two classes is (assumed to be) linear and the labels take on values either $\pm 1$ ideally we would like to fit a *discontinuous step function with a linear boundary* to such a dataset.  What does such a function look like?  When $N = 2$ it is simply a line $y = w_0 + w_1x_{\,}$ and compose it with the $\text{sign}(\cdot)$ function

\begin{equation}
\text{sign}\left(w_0 + w_1x_{\,} \right)
\end{equation}

Remember that the $\text{sign}(\cdot)$ function is defined as $\text{sign}(\alpha) = 
\begin{cases}
+1 \,\,\,\,\,\text{if} \,\, \alpha > 0 \\
-1 \,\,\,\,\,\text{if} \,\, \alpha < 0 \\
\end{cases}$.  Note here that what happens with $\text{sign}(0)$ is - for our purposes - arbitrary (i.e., it can be set to any fixed value or left undefined as we have done).

How do we tune the parameters of the line?  We could try to take the lazy way out and *first* fit the line to the classification dataset via linear regression, then compose the line with the sign function to get a step function fit.  However this does not work well in general - as we will see even in the simple instance below.  Instead we need to tune the parameters $w_0$ and $w_1$ *after* composing the line with the sign function, or in other words we need to tune the parameters of $\text{sign}\left(w_0 + w_1x_{\,} \right)$.

#### <span style="color:#a50e3e;">Example 1: </span> Fitting a line and taking the sign fails to represent a two-class dataset well

In the Python cell below we load in a simple two-class dataset (top panel), fit a line to this dataset via linear regression, and then compose the fitted line with the sign function to produce a step function fit.  Both the linear fit (in green) as well as its composition with the sign function (in dashed red) are shown along with the data in the bottom panel.  Of course the line itself provides a terrible representation of the nonlinear dataset.  But its evaluation through sign is also quite poor for such a simple dataset, failing to properly identify two points on the top step.  In the parlance of classification these types of points are referred to as *misclassified points*.

In [2]:
# load in dataset
# datapath = '../../mlrefined_datasets/superlearn_datasets/2d_logistic_data_v1.csv'
# data = np.asarray(pd.read_csv(datapath,header = None))

# load in dataset
data = np.loadtxt('2d_classification_data_v1.csv')

# create instance of linear regression demo, used below and in the next examples
demo1 = superlearn.logistic_regression_simple_demos.visualizer(data)
demo1.run_algo(algo = 'newtons_method',w_init = [-1,-1], max_its = 1)

# plot dataset
demo1.naive_fitting_demo()

<IPython.core.display.Javascript object>

----

How do we tune these parameters properly?  As with linear regression, here we can try to setup a proper Least Squares function that - when minimized - recovers our ideal weights. We can do this by simply reflecting on the sort of ideal relationship we want to find between the input and output of our dataset. 

Take a single point $\left(x_p,\,y_p \right)$.  Notice in the example above - and this is true more generally speaking - that ideally for a good fit we would like our weights to be such if this point has a label $+1$ it lies positive region of the space where $w_0 + w_1x > 0$ so that $\text{sign}\left(w_0 + w_1x_p\right) = +1$ matches its label value.  Likewise if this point has label $-1$ we woould like it to in the negative region where $w_0 + w_1x < 0$ so that $\text{sign}\left(w_0 + w_1x_p\right) = -1$ matches its label value.  So in short what we would ideally like for this point is that its evaluation matches its labell value, i.e., that

\begin{equation}
\text{sign}\left(w_0 + w_1x_p \right) = y_p 
\end{equation}

And of course we would like this to hold for every point.  To find weights that satisfy this set of $P$ equalities as best as possible we could - as we did previously with linear regression - square the difference between both sides of each and sum them up, giving the Least Squares function

\begin{equation}
g(\mathbf{w}) = \sum_{p=1}^P \left(\text{sign}\left(w_0 + w_1x_p \right) - y_p \right)^2
\end{equation}

and we can try to minimize this in order to recover weights that satisfy our desired equalities.  If we can find a set of weights such that $g(\mathbf{w}) = 0$ then all $P$ equalities above hold true, otherwise some of them do not.  Notice: the 'error' this Least Squares cost counts is the number of *misclassified points* for a particular choice of weights,  it is always integer valued.  Because it counts the number of misclassifications this cost function is often referred to as the *counting cost* for two-class classification.

Unfortunately because the counting cost takes on only integer values it is impossible to minimize, as at every point the function is completely flat, i.e., it has exactly zero gradient.  Because of this neither gradient descent nor Newton's method can take a single step 'downhill' regardless of where they are initialized.  This problem is inherited from our use of the sign function, itself a discontiuous step.

#### <span style="color:#a50e3e;">Example 2: </span> Visualizing the counting cost on a simple dataset

In the next Python cell we plot the counting cost (left panel) for the dataset displayed in Example 1, over a wide range of values for $w_0$ and $w_1$.  The counting cost surface consists of discrete steps at many different levels, each one completely flat.  Because of this no local method can be used to minimize the counting cost.

In the middle and right panels we plot the surfaces of two related cost functions on the same dataset.  We introduce the cost function shown in the middle panel in the next subection, and the cost in the right panel in the one that follows.  We can indeed minimize either of these using a local method to recover ideal weights.

In [9]:
data = np.loadtxt('2d_classification_data_v1.csv')
demo2 = superlearn.cost_visualizer.visualizer(data)
demo2.plot_costs(viewmax = 25,view = [20,125])

<IPython.core.display.Javascript object>

## 1.2  Introducing tanh and a logistic Least Squares

As mentioned above, we cannot directly minimize the counting cost due to our use of the sign function.  In other words, we cannot direclty fit a *discontinuous* step function to our data.  In order to go further we need to replace the sign, ideally with a *continuous* function that matches it very closely everywhere.  Thankfully such a function is readily available: the hyperbolid tangent function tanh - also known as a *logistic sigmoid*.  In the Figure below we plot tanh, along with several internally weighted versions of it.  As we can see in the figure, for the correct setting of internal weights tanh can be made to look arbitrarily similar to the sign function.

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_11.png' width="80%" height="80%" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> RELABEL T AS X IN BOTH PANELS (left panel) Plot of
the 'tanh' function $\text{tanh}(x)$.  (right panel) By increasing the weight $w$  $\text{tanh}(wx)$ from} $w=1$ (shown in red) to $w=2$ (shown in green) and finally to} $w=10$ (shown in blue), it becomes an increasingly
good approximator of a sign function. </em>  </figcaption> 
</figure>

Swapping out the sign function with tanh in equation (2) we aim to satisfy as many of the $P$ equations

\begin{equation}
\text{tanh}\left(w_0 + w_1x_p \right) = y_p 
\end{equation}

as possible.  Using the same logic used previously to achieve this we can form a Least Squares function for *logistic regression* (i.e., the fitting of the logistic sigmoid tanh to a dataset) as

\begin{equation}
g(\mathbf{w}) = \sum_{p=1}^P \left(\text{tanh}\left(w_0 + w_1x_p \right) - y_p \right)^2
\end{equation}

whose global minimum provides for equation (4) holding for as many points as possible.

This cost function can indeed be minimized using local methods, but it is non-convex and contains large flat regions which limits our choice of algorithms to *normalized gradient descent*.  In Example 2 we plotted the surface of this Least Squares cost in the middle panel of the printout using the toy dataset from Example 1.

#### <span style="color:#a50e3e;">Example 3: </span> Using normalized gradient descent to perform logistic regression using the Least Squares cost

In this example we show how normalized gradient descent can be used to minimize the logistic Least Squares cost function described above.  First, we write out this cost function in Python in the next cell.  This translation is fairly straightforward.

In [9]:
# define the input and output of our dataset
x = data[:,0]
x.shape = (len(x),1)
y = data[:,1]
y.shape = (len(y),1)

# tanh non-convex logistic least squares cost function
def tanh_least_squares(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p,:]
        y_p = y[p]
        cost +=(np.tanh(w[0] + w[1]*x_p) - y_p)**2
    return cost

With this cost function defined in Python we can now run our demonstration of normalized gradient descent below.  We will use the automatic differentiator [autograd](https://github.com/HIPS/autograd) here (as well as in future examples in this post) to compute derivatives of the cost, along with the generic Python code for gradient descent provided in the respective post in our *mathematical optimization* series.  We initialize at the point $\mathbf{w} = \begin{bmatrix} 3 \\ 3 \end{bmatrix}$, set $\alpha = 1$, and run for 25 steps.  Each step of the run is animated - along with its corresponding fit to the data.  In the left panel both the data and the fit at each step (colored green to red as the run continues) are shown, while in the right panel the contour of the cost function is shown with each step marked (again colored green to red as the run continues).  Moving the slider from left to right animates the run from start to finish.

In [19]:
# create instance of logisic regression demo and load in data
demo3 = superlearn.logistic_regression_2d_demos.visualizer(data)

# setup algo, initial point, and max number of iterations
demo3.run_algo(g = tanh_least_squares,algo = 'gradient_descent',version = 'normalized',w_init = [3,3], max_its = 25,alpha = 1)

# animate descent process
demo3.animate_it_2d(num_contours = 25,viewmax = 4.5)

Near the end of the run notice how the algorithm zig-zags back and forth along a long narrow valley while moving towards the minimum of the function (this long narrow valley towards the minimum can also be seen in the surface plot of the cost function - shown in the middle panel of Example 2).  This behavior is common for gradient descent when running down long narrow valleys, in particular when using the normalized form when the length of each step is kept the same by fixing the steplength parameter $\alpha$.

Nonetheless while we did not run it until convergence here we can indeed reach the minimum of the cost function using gradient descent if we run long enough (possibly adjusting the steplength parameter $\alpha$ along the way). We show this in the next Python cell where we run the same algorithm from the initial point $\mathbf{w} = \begin{bmatrix} 20 \\ -20 \end{bmatrix}$ even farther away from the global minimum for 900 steps.  However the zig-zagging effect seen here will mean that we need to take a great many steps to reach a minimum. 

In [37]:
# create instance of logisic regression demo and load in data
demo3 = superlearn.logistic_regression_2d_demos.visualizer(data)

# setup algo, initial point, and max number of iterations
demo3.run_algo(g = tanh_least_squares,algo = 'gradient_descent',version = 'normalized',w_init = [20,-20], max_its = 900,alpha = 1)

# animate descent process
demo3.static_it_2d(num_contours = 25,viewmax = 31)

<IPython.core.display.Javascript object>

---

The optimization phenomenon we saw in the previous example - that of zig-zagging and long narrow valleys leading to global minima - are common occurrence with *neural networks* cost functions (our logistic Least Squares cost function being a simple example of one).  As we will later there are a number of practical 'tricks' we can use to try to ameliorate zig-zagging behavior in general (like e.g., adding a *momentum* term as well as *regularization*). 

For the relatively simple case of logistic regression we can more easily ameliorate this problem by going back to our desired equalities shown in equation (4) and re-formulating the logistic regression problem from scratch.  This leads to a different (and, as we will see, always convex) cost function for logistic regression that is more commonly used in practice.

## 1.3  The convex logistic regression 

There is more than one way to form a cost function whose minimum forces as many of the $P$ equalities in equation (4) to hold as possible.  This is because tanh is an *odd* function - that is $\text{tanh}(-wx) = -\text{tanh}(wx)$ always - and because $y_p \in \{ \pm1 \}$.  Using these two characteristics notice how equation (4) can be rearranged by multiplying both sides by $y_p$ as 

\begin{equation}
\text{tanh}(y_p\left(w_0 + w_1x\right)) = 1
\end{equation}

Here the right hand side of the equality is one since we always have that $y_p^2 = 1$.  Now we can use the exponential definition of tanh which in general is

\begin{equation}
\text{tanh}(s) = 2\frac{1}{1 + e^{-s}}-1
\end{equation}

to rewrite each desired equality equivalently as 

\begin{equation}
1 + e^{-y_p\left(w_0 + w_1x\right)} = 1
\end{equation}

From here we could either subtract one off of both sides to give equivalently

\begin{equation}
e^{-y_p\left(w_0 + w_1x\right)} = 0
\end{equation}

or take the log of both sides to gives equivalently

\begin{equation}
\text{log}\left(1 + e^{-y_p\left(w_0 + w_1x\right)}\right) = 0.
\end{equation}

Both options provide an equivalent way of expressing the desired set of $P$ equalities implied by equation (4), both are nonnegative regardless of $p$, and both can be used to form convex cost functions for logistic regression (the former is often used in the context of *boosting*).  For example, summing over all $P$ of the latter expressions gives the commonly used *softmax* (also called the*log-loss*) cost function for logistic regression 

\begin{equation}
g(\mathbf{w}) = \sum_{p=1}^P\text{log}\left(1 + e^{-y_p\left(w_0 + w_1x_p\right)}\right)
\end{equation}

The softmax cost is *always convex* regardless of the dataset used - we will see this empirically in the examples below and a mathematical proof is provided in the appendix of this post the verifies this claim more generally.  We displayed a particular instance of the cost surface in the right panel of Example 2 for the dataset first shown in Example 1.  Looking back at this surface plot we can see that it is indeed convex.  

Since the cost function is convex unnormalized version of gradient descent is more convenient to use here (as is the case whenever a function is convex, as was detailed in the post on gradient descent in our series on *mathematical optimization*).  This can make management of the steplength parameter $\alpha$ a simpler ordeal in practice since the length of each step is also controlled by the magnitude of the gradient.  Convexity also allows us to immediately apply Newton's method to logistic regression,  which can provide rapid convergence to a global minima.  Furthermore, for a convex function like this both local methods can be guaranteed mathematically to - when implemented correctly - converge to a global minima.

For these reasons - as well as because the traditional probabilistic argument for deriving logistic regression leads to a cost function (called the *cross-entropy* cost) that is equivalent to the softmax cost above - the softmax cost is used more often in practice for logistic regression than is the tanh Least Squares cost.

> The softmax cost is more widely used in practice for logistic regression than is the tanh Least Squares cost.  Being convex we can use e.g., use Newton's method to minimize the softmax cost, and we have the added confidence of knowing that local methods (gradient descent and Newton's method) are assured to converge to its local minima.

Note however that while we see less zig-zagging behavior when applying (unnormalized) gradient descent to the softmax cost, convergence of gradient descent can still be slow due to the scaling of the length of each step by the magnitude of the gradient (which vanishes as we approach a global minima).  This can nullify the improved behavior of gradient descent (i.e., less zig-zagging as compared with the tanh Least Squares) with the softmax cost depending.

#### <span style="color:#a50e3e;">Example 4: </span> Using unnormalized gradient descent to perform logistic regression using the softmax cost

In this example we repeat the experiments of Example 3 using the softmax cost and unnormalized gradient descent.  First, we write out this cost function in Python in the next cell.  Once again the translation is fairly straightforward.

In [11]:
# define the input and output of our dataset
x = data[:,0]
x.shape = (len(x),1)
y = data[:,1]
y.shape = (len(y),1)

# the convex softmax cost function
def softmax(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p,:]
        y_p = y[p]
        cost += np.log(1 + np.exp(-y_p*(w[0] + w[1]*x_p)))
    return cost

With our cost function defined in Python we can now run our demonstration.  We initialize at the point $\mathbf{w} = \begin{bmatrix} 3 \\ 3 \end{bmatrix}$, set $\alpha = 1$, and run for 25 steps.  Each step of the run is animated - along with its corresponding fit to the data.  In the left panel both the data and the fit at each step (colored green to red as the run continues) are shown, while in the right panel the contour of the cost function is shown with each step marked (again colored green to red as the run continues).  Moving the slider from left to right animates the run from start to finish.  Note how we still have some zig-zagging behavior here, but since we can safely use unnormalized gradient descent here its oscillations rapidly decrease in magnitude since the length of is directly controlled by the magnitude of the gradient. 

In [29]:
# create instance of logisic regression demo and load in data
demo3 = superlearn.logistic_regression_2d_demos.visualizer(data)

# setup algo, initial point, and max number of iterations
demo3.run_algo(g = softmax,algo = 'gradient_descent',version = 'unnormalized',w_init = [3,3], max_its = 25,alpha = 1)

# animate descent process
demo3.animate_it_2d(num_contours = 25,viewmax = 6)

Running the algorithm until convergence here we can indeed reach the minimum of the cost function using gradient descent.  In the next Python cell we run the same algorithm from the initial point $\mathbf{w} = \begin{bmatrix} 20 \\ -20 \end{bmatrix}$ even farther away from the global minimum for 900 steps.  Indeed by this point we reach a point quite close to a global minima of the function.

The resulting fit here is comparable to the one found in the previous example, where we used the same initial point, steplength parameter $\alpha$, and number of steps to minimize the tanh Least Squares cost (using the normalized form of gradient descent).  Here the fact that the magnitude of the gradient controls the length of each step means that steps near the solution become very small, and require a large number of them to converge.

In [35]:
# create instance of logisic regression demo and load in data
demo3 = superlearn.logistic_regression_2d_demos.visualizer(data)

# setup algo, initial point, and max number of iterations
demo3.run_algo(g = softmax,algo = 'gradient_descent',version = 'unnormalized',w_init = [20,-20], max_its = 900,alpha = 1)

# animate descent process
demo3.static_it_2d(num_contours = 25,viewmax = 30)

<IPython.core.display.Javascript object>

#### <span style="color:#a50e3e;">Example 4: </span> Using Newton's method to perform logistic regression using the softmax cost

Newton's method can also be used to minimize the softmax cost, taking just a few steps to reach a point near global minima when properly initialized.  In this example we illustrate the progress of 5 Newton steps beginning at the point $\mathbf{w} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$.  Within 5 steps we have reached a point providing a very good fit to the data.  With a cost function like the softmax we must be careful not to take too many steps, as the global minima technically lies at infinity (this is where tanh function perfectly resembles the sign function) and taking too many large steps towards it can lead to numerical issues as $\mathbf{w}$ becomes extremely large.

In [19]:
# create instance of logisic regression demo and load in data
demo3 = superlearn.logistic_regression_2d_demos.visualizer(data)

# setup algo, initial point, and max number of iterations
demo3.run_algo(g = softmax,algo = 'newtons_method',w_init = [1,1], max_its = 5,epsilon = 10**-5)

# animate descent process
demo3.static_it_2d(num_contours = 25,viewmax = 30)

<IPython.core.display.Javascript object>

---

---

Using the same logic that led us here to the softmax cost when $N = 2$ inputs we can derive the same set of desired equalities more generally with $N$ inputs  

\begin{equation}
\text{log}\left( 1 + e^{-y_p\left(w_{0}+ x_{1,p}w_{1} + x_{2,p}w_{2} + \cdots x_{N,p}w_{N}\right)} \right)  = 1
\end{equation}

Using the more compact the vector notation we can write the above more compactly via the *inner product* 

\begin{equation}
\mathbf{w}=\left[\begin{array}{c}
w_{0}\\
w_{1}\\
\vdots\\
w_{N}
\end{array}\right]
\,\,\,\,\,\,\,\,\,\,\,\,\,\,
\mathbf{x}_p=\left[\begin{array}{c}
1\\
x_{1,p}\\
\vdots\\
x_{N,p}
\end{array}\right]
\end{equation}

as

\begin{equation}
\text{log}\left(1 + e^{-y_p^{\,}\mathbf{x}_p^T \mathbf{w}_{\,}^{\,}}\right)  = 1.
\end{equation}

And with this we can write the softmax cost function for logistic regression for general $N$ compactly as 




\begin{equation}
g(\mathbf{w}) = \sum_{p=1}^P\text{log}\left(1 + e^{-y_p^{\,}\mathbf{x}_p^T \mathbf{w}_{\,}^{\,}}\right).
\end{equation}

#### <span style="color:#a50e3e;">Example 5: </span>  An example with $N = 2$ inputs

In this example we will minimize the softmax cost to perform logistic regression on the following two-class dataset.  This multi-input ($N = 2$) dataset has 100 points whose two classes cannot be perfectly separated by any line (and whose points cannot be perfectly fit using a tanh or even sign function).  In the left panel we show the data in three dimensions, and in the right panel we show the data from above in the 'separator' view.  

In [13]:
# load in dataset
data = np.loadtxt('2d_classification_data_v2.csv',delimiter = ',')

# create instance of linear regression demo, used below and in the next examples
demo4 = superlearn.logistic_regression_3d_demos.visualizer(data)

# plot data
demo4.plot_data(view = [15,-140])

<IPython.core.display.Javascript object>

As you can see, there are three points here (one blue point and two red points) that look like they are on 'the wrong side', other than these 'noisy' points the two classes look like they can be very well fit with our nonlinear regressor $\text{tanh}\left(w_0 + w_1x_1 + w_2x_2\right))$ (in the left panel) / separated by a line $w_0 + w_1x_1 + w_2x_2 = 0$ (in the right panel) provided the weights are tuned properly.  Note: in the classification context a 'noisy' point is one that has an incorrect label.

> In the context of classification a 'noisy' point is one that has an incorrect label.

In the next Python cell we run Newton's method for 5 iterations (starting at a random initialization) on this dataset using the softmax cost function.  In the left panel the dataset is shown in three dimensions, along with tanh surface fit.  We also show draw the separator or *decision boundary* in the plane $y = 0$.  This plane shows the linear separator (where the tanh surface is zero) and is colored to indicate which of the input are determined to be of class $-1$ (the region of space colored light red) and $+1$ (the region of space colored light blue).  In the right panel we plot the data, separator, and classified regions colored in the same way from the 'separator' view.  Here we can easily identify three points that have been misclassified - our three 'on the wrong side' points (the one blue point in the red region, and the two red points in the blue region).

In [11]:
# setup algo, initial point, and max number of iterations
demo4.run_algo(algo = 'newtons_method',w_init = np.random.randn(3,1), max_its = 10)

# draw the final results
demo4.static_it_3d(view = [15,-140])

<IPython.core.display.Javascript object>

## 1.4  Classification nomenclature, predicting new labels, and accuracy

Regardless of the cost function we use to performm logistic regression (e.g., tanh Least Squares or softmax cost) once the weights have been tuned it is generally referred to as a *trained classifier*. 

> Regardless of the cost function we use to performm logistic regression (e.g., tanh Least Squares or softmax cost) once the weights have been tuned it is generally referred to as a *trained classifier*. 

Moreover, remember that the general goal with logistic regression is to learn a sset of $N$ parameters $\mathbf{w}$ so that the nonlinear logistic function

\begin{equation}
f(\mathbf{x}) = \text{tanh}\left(\mathbf{w}^T\mathbf{x}^{\,}\right)
\end{equation}

fits the given dataset as best as possible (this was as close as we could get to directly fitting the discontinuous step function  $\text{sign}\left(\mathbf{w}^T\mathbf{x}^{\,}\right)$ directly to the data).  Importantly, this function (sometimes called a *decision function*) provides us with a learned linear *separator* or *decision boundary* where it intersects the plane $y = 0$.  This is where

\begin{equation}
\mathbf{w}^T\mathbf{x}^{\,} = 0.
\end{equation}

This boundary is used to make predictions about every point in the space.  If a point $\mathbf{x}$ lies above it (i.e., if $\mathbf{w}^T\mathbf{x}^{\,} > 0$) then this point has a predicted label $+1$, likewise if the point lies below (i.e., if $\mathbf{w}^T\mathbf{x}^{\,} < 0$) the point has a predicted label $-1$.  If a point lies precisely on the boundary we cannot say which class it belongs to (in such cases one is typically assigned to the point at random).  Wrapping together these conditions we can say that to predict the label of an input point $\mathbf{x}$ we evaluate its weighted version using the sign function.  Denoting the predicted label $y_{\text{predict}}$ we have

\begin{equation}
\text{sign}(\mathbf{w}^T\mathbf{x}^{\,}) = y_{\text{predict}}
\end{equation}

> Once weights are properly tuned the *decision boundary* $\mathbf{w}^T\mathbf{x}^{\,} = 0$ provides a rule for determining which points - according to the classifier - belong to class $+1$ or $-1$.  This is done by  computing $\text{sign}(\mathbf{w}^T\mathbf{x}^{\,})$.

This tells us both how to evaluate the effecacy of any set of weights $\mathbf{w}$ for a dataset of $P$ training points - we simply check the 

## 1.6 Appendix

### Proof that the Least Squares cost function is convex

### Computation of the Lipschitz constant 