***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 6-Optimization theory and algorithms  
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* May 26, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# You will need the files:
#     * mmids.py
#     * customer_airline_satisfaction.csv
#     * advertising.csv 
#     * SAHeart.csv 
# from https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
#
# IF RUNNING ON GOOGLE COLAB (RECOMMENDED):
# "Upload to session storage" from the Files tab on the left
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import torch
import mmids

## Motivating example:  analyzing customer satisfaction

**Figure:** Helpful map of ML by scitkit-learn ([Source](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html))

![ml-cheat-sheet](https://scikit-learn.org/1.4/_static/ml_map.png)

$\bowtie$

We now turn to classification.

Quoting [Wikipedia](https://en.wikipedia.org/wiki/Statistical_classification):

> In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.

We will illustrate this problem on an [airline customer satisfaction](https://www.kaggle.com/datasets/sjleshrac/airlines-customer-satisfaction) dataset available on [Kaggle](https://www.kaggle.com), an excellent source of data and community contributed analyses. The background is the following: 

> The dataset consists of the details of customers who have already flown with them. The feedback of the customers on various context and their flight data has been consolidated. The main purpose of this dataset is to predict whether a future customer would be satisfied with their service given the details of the other parameters values.

We first load the data and convert it to an appropriate matrix representation. We (or, more precisely, ChatGPT) pre-processed the original file to removing rows with missing data or 0 ratings, convert categorical variables into one-hot encodings, and keep only a subset of the rows and columns. You can see the details of the pre-processing in this [chat history](https://chatgpt.com/share/c5070b9c-f33f-4a37-a793-fde0d7cb7b06).

In [None]:
data = pd.read_csv('customer_airline_satisfaction.csv')
data.head()

It has 24 columns and the number of rows is:

In [None]:
len(data.index)

The column names are:

In [None]:
print(data.columns.tolist())

The first column indicates whether a customer was satisfied (with `1` meaning satisfied). The next 6 columns give some information about the customers, e.g., their age or whether they are members of a loyalty program with the airline. The following three columns give information about the flight, with names that should be self-explanatory: `Flight Distance`, `Departure Delay in Minutes`, and `Arrival Delay in Minutes`. The remaining columns give the customers' ratings, between `1` and `5`, of various feature, e.g., `Baggage handling`, `Checkin service`.

Our goal will be to predict the first column, `Satisfied`, from the rest of the columns. For this, we transform our data into Numpy arrays.

In [None]:
y = data['Satisfied'].to_numpy()
X = data.drop(columns=['Satisfied']).to_numpy()

In [None]:
print('y=',y)
print('X=',X)

Some features may affect satisfication more than others. Let us look at age for instance. The following code extracts the `Age` column from `X` (i.e., column $0$) and computes the fraction of satisfied customers in several age bins.

Explanation by ChatGPT (who wrote the code):

1. [`numpy.digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) bins the age data into the specified age bins. The `-1` adjustment is to match zero-based indexing.
2. [`numpy.bincount`](https://numpy.org/doc/stable/reference/generated/numpy.bincount.html) counts the occurrences of each bin index. The `minlength` parameter ensures that the resulting array length matches the number of age bins (`age_labels`). This is important if some bins have zero counts, ensuring the counts array covers all bins.
3. `freq_satisfied = counts_satisfied / counts_all` calculates the satisfaction frequency for each age group by dividing the counts of satisfied customers by the total counts in each age group.
4. The results are plotted using matplotlib's [`matplotlib.pyplot.bar`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) function.

We see in particular that younger people tend to be more dissatisfied. Of course, this might be because they cannot afford the most expensive services.

In [None]:
# Extract the 'Age' column (index 0 in the array)
age_col_index = 0
age_data = X[:, age_col_index]

# Define the age bins and labels
age_bins = [0, 18, 25, 35, 45, 55, 65, 100]
age_labels = ['0-17', '18-24', '25-34', '35-44', '45-54', '55-64', '65+']

# Use np.digitize to bin the age data
age_bin_indices = np.digitize(age_data, bins=age_bins) - 1

# Use np.bincount to count occurrences
counts_all = np.bincount(age_bin_indices, minlength=len(age_labels))
counts_satisfied = np.bincount(age_bin_indices[y == 1], minlength=len(age_labels))

# Calculate the frequencies
freq_satisfied = counts_satisfied / counts_all

# Prepare data for plotting
age_group_labels = np.array(age_labels)

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(age_group_labels, freq_satisfied, color='blue', alpha=0.7)
plt.title('Frequency of Satisfaction by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Frequency of Satisfied Customers')
plt.grid(True)
plt.show()

The input data is now of the form $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$ where $\mathbf{x}_i \in \mathbb{R}^d$ are the features and $y_i \in \{0,1\}$ is the label. Above we use the matrix representation $X \in \mathbb{R}^{d \times n}$ with columns $\mathbf{x}_i$, $i = 1,\ldots, n$ and $\mathbf{y} = (y_1, \ldots, y_n)^T \in \{0,1\}^n$. 

Our goal: 

> learn a classifier from the examples $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$, that is, a function $\hat{f} : \mathbb{R}^d \to \mathbb{R}$ such that $\hat{f}(\mathbf{x}_i) \approx y_i$.

We may want to enforce that the output is in $\{0,1\}$ as well. This problem is referred to as [binary classification](https://en.wikipedia.org/wiki/Binary_classification).

A natural approach to this type of [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) problem is to define two objects:

1. **Family of classifiers:** A class $\widehat{\mathcal{F}}$ of classifiers from which to pick $\hat{f}$.

2. **Loss function:** A loss function $\ell(\hat{f}, (\mathbf{x},y))$ which quantifies how good of a fit $\hat{f}(\mathbf{x})$ is to $y$.

Our goal is then to solve

$$
\min_{\hat{f} \in \widehat{\mathcal{F}}} \frac{1}{n} \sum_{i=1}^n \ell(\hat{f}, (\mathbf{x}_i, y_i)),
$$

that is, we seek to find a classifier among $\widehat{\mathcal{F}}$ that minimizes the average loss over the examples.

For instance, in [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), we consider linear classifiers of the form

$$
\hat{f}(\mathbf{x})
= \sigma(\mathbf{x}^T \boldsymbol{\theta})
\qquad
\text{with}
\qquad
\sigma(t) = \frac{1}{1 + e^{-t}}
$$

where $\boldsymbol{\theta} \in \mathbb{R}^d$ is a parameter vector. And we use the [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression)

$$
\ell(\hat{f}, (\mathbf{x}, y))
= -  y \log(\sigma(\mathbf{x}^T \boldsymbol{\theta}))
- (1-y) \log(1- \sigma(\mathbf{x}^T \boldsymbol{\theta})).
$$

In parametric form, the problem boils down to

$$
\min_{\boldsymbol{\theta} \in \mathbb{R}^d}
- \frac{1}{n} \sum_{i=1}^n y_i \log(\sigma(\mathbf{x}_i^T \boldsymbol{\theta}))
- \frac{1}{n} \sum_{i=1}^n (1-y_i) \log(1- \sigma(\mathbf{x}_i^T \boldsymbol{\theta})).
$$

To obtain a prediction in $\{0,1\}$ here, we could cutoff $\hat{f}(\mathbf{x})$ at a threshold $\tau \in [0,1]$, that is, return $\mathbf{1}\{\hat{f}(\mathbf{x}) > \tau\}$.

We will explain in a later chapter where this choice comes from.

The purpose of this chapter is to develop some of the mathematical theory and algorithms needed to solve this type of optimization formulation.

## Background: review of differentiable functions of several variables

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$

## Optimality conditions

**EXAMPLE:** Consider $f(x) = e^x$. Then $f'(x) = f''(x) = e^x$. Suppose we are interested in approximating $f$ in the interval $[0,1]$. We take $a=0$ and $b=1$ in *Taylor's Theorem*. The linear term is 

$$
f(a) + (x-a) f'(a) = 1 + x e^0 = 1 + x.
$$

Then for any $x \in [0,1]$

$$
f(x) = 1 + x + \frac{1}{2}x^2 e^{\xi_x}
$$

where $\xi_x \in (0,1)$ depends on $x$. We get a uniform bound on the error over $[0,1]$ by replacing $\xi_x$ with its worst possible value over $[0,1]$ 

$$
|f(x) - (1+x)| \leq \frac{1}{2}x^2 e^{\xi_x} \leq \frac{e}{2} x^2.
$$

In [None]:
x = np.linspace(0,1,100)
y = np.exp(x)
taylor = 1 + x
err = (np.exp(1)/2) * x**2

In [None]:
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.legend()
plt.show()

If we plot the upper and lower bounds, we see that $f$ indeed falls within them.

In [None]:
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.plot(x,taylor-err,linestyle=':',color='green',label='lower')
plt.plot(x,taylor+err,linestyle='--',color='green',label='upper')
plt.legend()
plt.show()

$\lhd$

**EXAMPLE:** Let $f(x) = x^3$. Then $f'(x) = 3 x^2$ and $f''(x) = 6 x$ so that $f'(0) = 0$ and $f''(0) \geq 0$. Hence $x=0$ is a stationary point. But $x=0$ is not a local minimizer. Indeed $f(0) = 0$ but, for any $\delta > 0$, $f(-\delta) < 0$.

In [None]:
x = np.linspace(-2,2,100)
y = x**3

In [None]:
plt.plot(x,y)
plt.ylim(-5,5)
plt.show()

$\lhd$

**EXAMPLE:** If we want to minimize $2 x_1^2 + 3 x_2^2$ over all two-dimensional unit vectors $\mathbf{x} = (x_1, x_2)$, then we can let

$$
f(\mathbf{x}) = 2 x_1^2 + 3 x_2^2
$$

and

$$
h_1(\mathbf{x}) = 1 - x_1^2 - x_2^2 = 1 - \|\mathbf{x}\|^2.
$$

Observe that we could have chosen a different equality constraint to express the same minimization problem. $\lhd$

**EXAMPLE:** **(continued)** Returning to the previous example,

$$
\nabla f(\mathbf{x})
= \left(
\frac{\partial f(\mathbf{x})}{\partial x_1},
\frac{\partial f(\mathbf{x})}{\partial x_2}
\right)
= (4 x_1, 6 x_2)
$$

and

$$
\nabla h_1(\mathbf{x})
= \left(
\frac{\partial h_1(\mathbf{x})}{\partial x_1},
\frac{\partial h_1(\mathbf{x})}{\partial x_2}
\right)
= (- 2 x_1, - 2 x_2).
$$

The conditions in the theorem read

\begin{align*}
&4 x_1 - 2 \lambda_1 x_1  = 0\\
&6 x_2 - 2 \lambda_1 x_2  = 0.
\end{align*}

The constraint $x_1^2 + x_2^2 = 1$ must also be satisfied. Observe that the linear independence condition is automatically satisfied since there is only one constraint.

There are several cases to consider. 

1- If neither $x_1$ nor $x_2$ is $0$, then the first equation gives $\lambda_1 = 2$ while the second one gives $\lambda_1 = 3$. So that case cannot happen.

2- If $x_1 = 0$, then $x_2 = 1$ or $x_2 = -1$ by the constraint and the second equation gives $\lambda_1 = 3$ in either case.

3- If $x_2 = 0$, then $x_1 = 1$ or $x_1 = -1$ by the constraint and the first equation gives $\lambda_1 = 2$ in either case.

Does any of these last four solutions, i.e., $(x_1,x_2,\lambda_1) = (0,1,3)$, $(x_1,x_2,\lambda_1) = (0,-1,3)$, $(x_1,x_2,\lambda_1) = (1,0,2)$ and $(x_1,x_2,\lambda_1) = (-1,0,2)$, actually correspond to a local minimizer?

This problem can be solved manually. Indeed, replace $x_2^2 = 1 - x_1^2$ into the objective function to obtain 

$$
2 x_1^2 + 3(1 - x_1^2)
= -x_1^2 + 3.
$$

This is minimized for the largest value that $x_1^2$ can take, namely when $x_1 = 1$ or $x_1 = -1$. Indeed, we must have $0 \leq x_1^2 \leq x_1^2 + x_2^2 = 1$. So both $(x_1, x_2) = (1,0)$ and $(x_1, x_2) = (-1,0)$ are global minimizers. A fortiori, they must be local minimizers. 

What about $(x_1,x_2) = (0,1)$ and $(x_1,x_2) = (0,-1)$? Arguing as above, they in fact correspond to global *maximizers* of the objective function. $\lhd$

**EXAMPLE:** **(continued)** Returning to the previous example, the points satisfying $h_1(\mathbf{x}) = 0$ sit on the circle of radius $1$ around the origin. We have already seen that 

$$
\nabla h_1(\mathbf{x})
= \left(
\frac{\partial h_1(\mathbf{x})}{\partial x_1},
\frac{\partial h_1(\mathbf{x})}{\partial x_2}
\right)
= (- 2 x_1, - 2 x_2).
$$

Here is code plotting these (courtesy of ChatGPT 4). It uses [`numpy.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html) to generate a grid of points for $x_1$ and $x_2$, and [`matplotlib.pyplot.contour`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html) to plot the constraint set as a [contour line](https://en.wikipedia.org/wiki/Contour_line) (for the constant value $0$) of $h_1$. The gradients are plotted with the [`matplotlib.pyplot.quiver`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.quiver.html) function, which is used for plotting vectors as arrows. 

In [None]:
# Define the constraint function
def h1(x1, x2):
    return 1 - x1**2 - x2**2

# Generate a grid of points for x1 and x2
x1 = np.linspace(-1.5, 1.5, 400)
x2 = np.linspace(-1.5, 1.5, 400)
X1, X2 = np.meshgrid(x1, x2)

# Compute constraint function on grid
H1 = h1(X1, X2)

# Points on the constraint where the gradients will be plotted
points = [
    (0.5, np.sqrt(3)/2),
    (-0.5, np.sqrt(3)/2),
    (0.5, -np.sqrt(3)/2),
    (-0.5, -np.sqrt(3)/2),
    (1, 0),
    (-1, 0),
    (0, 1),
    (0, -1)
]

In [None]:
plt.figure(figsize=(8, 6))
plt.grid(True)
plt.axis('equal')

# Plot the constraint set where h1(x1, x2) = 0
plt.contour(X1, X2, H1, levels=[0], colors='blue')

# Plot gradients of h1 (red) at specified points
for x1, x2 in points:
    plt.quiver(x1, x2, -2*x1, -2*x2, scale=10, color='red')

In [None]:
plt.figure(figsize=(8, 6))
plt.grid(True)
plt.axis('equal')
plt.contour(X1, X2, H1, levels=[0], colors='blue')
for x1, x2 in points:
    plt.quiver(x1, x2, -x1/np.sqrt(x1**2 + x2**2), 
               -x2/np.sqrt(x1**2 + x2**2), 
               scale=10, color='red')
    plt.quiver(x1, x2, 4*x1/np.sqrt(16 * x1**2 + 36 * x2**2), 
               6*x2/np.sqrt(16 * x1**2 + 36 * x2**2), 
               scale=10, color='green')

We see that, at $(-1,0)$ and $(1,0)$, the gradient is indeed orthogonal to the first-order feasible directions. $\lhd$

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bsigma}{\boldsymbol{\sigma}}$

## Gradient descent and its convergence analysis

**NUMERICAL CORNER:** We implement gradient descent in Python. We assume that a function `f` and its gradient `grad_f` are provided. We first code the basic steepest descent step with a step size $\alpha =$ `alpha`.

In [None]:
def desc_update(grad_f, x, alpha):
    return x - alpha*grad_f(x)

In [None]:
def gd(f, grad_f, x0, alpha=1e-3, niters=int(1e6)):
    
    xk = x0
    for _ in range(niters):
        xk = desc_update(grad_f, xk, alpha)

    return xk, f(xk)

We illustrate on a simple example.

In [None]:
def f(x): 
    return (x-1)**2 + 10

In [None]:
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()

In [None]:
def grad_f(x):
    return 2*(x-1)

In [None]:
gd(f, grad_f, 0)

We found a global minmizer in this case.

The next example shows that a different local minimizer may be reached depending on the starting point.

In [None]:
def f(x): 
    return 4 * (x-1)**2 * (x+1)**2 - 2*(x-1)

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-1,10))
plt.legend()
plt.show()

**CLICK ON TARGET:** If we start gradient descent from $-2$, where will it converge? $\ddagger$

In [None]:
def grad_f(x): 
    return 8 * (x-1) * (x+1)**2 + 8 * (x-1)**2 * (x+1) - 2

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
gd(f, grad_f, 0)

In [None]:
gd(f, grad_f, -2)

In the final example, we end up at a stationary point that is not a local minimizer. Here both the first and second derivatives are zero. This is known as a [saddle point](https://en.wikipedia.org/wiki/Saddle_point).

In [None]:
def f(x):
    return x**3

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
def grad_f(x):
    return 3 * x**2

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
gd(f, grad_f, 2)

In [None]:
gd(f, grad_f, -2, niters=100)

$\unlhd$

**NUMERICAL CORNER:** We revisit our first simple single-variable example.

In [None]:
def f(x): 
    return (x-1)**2 + 10

In [None]:
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()

Recall that the first derivative is:

In [None]:
def grad_f(x):
    return 2*(x-1)

So the second derivative is $f''(x) = 2$. Hence, this $f$ is $L$-smooth and $m$-strongly convex with $L = m = 2$. The theory we developed suggests taking step size $\alpha_t = \alpha = 1/L = 1/2$. It also implies that

$$
f(x^1) - f(x^*)
\leq \left(1 - \frac{m}{L}\right) [f(x^0) - f(x^*)]
= 0.
$$

We converge in one step! And that holds for any starting point $x^0$.

Let's try this!

In [None]:
gd(f, grad_f, 0, alpha=0.5, niters=1)

Let's try a different starting point.

In [None]:
gd(f, grad_f, 100, alpha=0.5, niters=1)

$\unlhd$

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bsigma}{\boldsymbol{\sigma}}$

## Application to logistic regression

We return to logistic regression, which we alluded to in the motivating example of this chapter.

The input data is of the form $\{(\boldsymbol{\alpha}_i, b_i) : i=1,\ldots, n\}$ where $\boldsymbol{\alpha}_i = (\alpha_{i,1}, \ldots, \alpha_{i,d}) \in \mathbb{R}^d$ are the features and $b_i \in \{0,1\}$ is the label. As before we use a matrix representation: $A \in \mathbb{R}^{n \times d}$ has rows $\boldsymbol{\alpha}_i^T$, $i = 1,\ldots, n$ and $\mathbf{b} = (b_1, \ldots, b_n) \in \{0,1\}^n$.

We summarize the logistic regression approach. Our goal is to find a function of the features that approximates the probability of the label $1$. For this purpose, we model the [log-odds](https://en.wikipedia.org/wiki/Logit) (or logit function) of the probability of label $1$ as a linear function of the features $\boldsymbol{\alpha}  \in \mathbb{R}^d$

$$
\log \frac{p(\mathbf{x}; \boldsymbol{\alpha})}{1-p(\mathbf{x}; \boldsymbol{\alpha})}
= \boldsymbol{\alpha}^T \mathbf{x}
$$

where $\mathbf{x} \in \mathbb{R}^d$ is the vector of coefficients (i.e., parameters). Inverting this expression gives

$$
p(\mathbf{x}; \boldsymbol{\alpha})
= \sigma(\boldsymbol{\alpha}^T \mathbf{x})
$$

where the [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) function is

$$
\sigma(z)
= \frac{1}{1 + e^{-z}}
$$

for $z \in \mathbb{R}$.

We plot the sigmoid function.

In [None]:
def sigmoid(z): 
    return 1/(1+np.exp(-z))

In [None]:
grid = np.linspace(-5, 5, 100)
plt.plot(grid,sigmoid(grid),'r')
plt.show()

We seek to maximize the probability of observing the data (also known as [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function)) assuming the labels are independent given the features, which is given by

$$
\mathcal{L}(\mathbf{x}; A, \mathbf{b})
= \prod_{i=1}^n p(\boldsymbol{\alpha}_i; \mathbf{x})^{b_i} 
(1- p(\boldsymbol{\alpha}_i; \mathbf{x}))^{1-b_i}
$$

Taking a logarithm, multiplying by $-1/n$ and substituting the sigmoid function, we want to minimize the [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression)

$$
\ell(\mathbf{x}; A, \mathbf{b})
= \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))
- (1-b_i) \log(1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\right\}.
$$

We used standard properties of the logarithm: for $x, y > 0$, $\log(xy) = \log x + \log y$ and $\log(x^y) = y \log x$. 

Hence, we want to solve the minimization problem

$$
\min_{\mathbf{x} \in \mathbb{R}^d} \ell(\mathbf{x}; A, \mathbf{b}).
$$

We are implicitly using here that the logarithm is a strictly increasing function and therefore does not change the global maximum of a function. Multiplying by $-1$ changes the global maximum into a global minimum.

To use gradient descent, we need the gradient of $\ell$. We use the *Chain Rule* and first compute the derivative of $\sigma$ which is

$$
\sigma'(z)
= \frac{e^{-z}}{(1 + e^{-z})^2}
= \frac{1}{1 + e^{-z}}\left(1 - \frac{1}{1 + e^{-z}}\right)
= \sigma(z) (1 - \sigma(z)).
$$

The latter expression is known as the [logistic differential equation](https://en.wikipedia.org/wiki/Logistic_function#Logistic_differential_equation). It arises in a variety of applications, including the modeling of [population dynamics](https://towardsdatascience.com/covid-19-infection-in-italy-mathematical-models-and-predictions-7784b4d7dd8d). Here it will be a convenient way to compute the gradient. 

Observe that, for $\boldsymbol{\alpha} = (\alpha_{1}, \ldots, \alpha_{d})  \in \mathbb{R}^d$, by the *Chain Rule*

$$
\nabla\sigma(\boldsymbol{\alpha}^T \mathbf{x})
= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \nabla (\boldsymbol{\alpha}^T \mathbf{x})
= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \boldsymbol{\alpha}
$$

where, throughout, the gradient is with respect to $\mathbf{x}$.

Alternatively, we can obtain the same formula by applying the single-variable *Chain Rule*

\begin{align*}
\frac{\partial}{\partial x_j} \sigma(\boldsymbol{\alpha}^T \mathbf{x})
&= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \frac{\partial}{\partial x_j}(\boldsymbol{\alpha}^T \mathbf{x})\\
&= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \frac{\partial}{\partial x_j}\left(\alpha_{j} x_{j} + \sum_{\ell=1, \ell \neq j}^d \alpha_{\ell} x_{\ell}\right)\\
&= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{j}
\end{align*}

so that

\begin{align*}
\nabla\sigma(\boldsymbol{\alpha}^T \mathbf{x})
&= \left(\sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{1}, \ldots, \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{d}\right)\\
&= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, (\alpha_{1}, \ldots, \alpha_{d})\\
&= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha}.
\end{align*}


By another application of the *Chain Rule*, since $\frac{\mathrm{d}}{\mathrm{d} z} \log z = \frac{1}{z}$,

\begin{align*}
&\nabla\ell(\mathbf{x}; A, \mathbf{b})\\
&= \nabla\left[\frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))
- (1-b_i) \log(1- \sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))\right\}\right]\\
&= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})
- \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla(1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\\
&= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})
+ \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}).
\end{align*}

Using the expression for the gradient of the sigmoid functions, this is

\begin{align*}
&= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) \,\boldsymbol{\alpha}_i\\
&\quad\quad + \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) \,\boldsymbol{\alpha}_i\\
&= - \frac{1}{n} \sum_{i=1}^n \left(
b_i (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) - (1-b_i)\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) 
\right)\,\boldsymbol{\alpha}_i\\
&= - \frac{1}{n} \sum_{i=1}^n (
b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) 
) \,\boldsymbol{\alpha}_i.
\end{align*}

To implement this formula below, it will be useful to re-write it in terms of the matrix representation $A \in \mathbb{R}^{n \times d}$ (which has rows $\boldsymbol{\alpha}_i^T$, $i = 1,\ldots, n$) and $\mathbf{b} = (b_1, \ldots, b_n) \in \{0,1\}^n$. Let $\bsigma : \mathbb{R}^n \to \mathbb{R}$ be the vector-valued function that applies the sigmoid $\sigma$ entry-wise, i.e., $\bsigma(\mathbf{z}) = (\sigma(z_1),\ldots,\sigma(z_n))$ where $\mathbf{z} = (z_1,\ldots,z_n)$. Thinking of $\sum_{i=1}^n (b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})\,\boldsymbol{\alpha}_i$ as a linear combination of the columns of $A^T$ with coefficients being the entries of the vector $\mathbf{b} - \bsigma(A \mathbf{x})$, we that 

$$
\nabla\ell(\mathbf{x}; A, \mathbf{b})
= - \frac{1}{n} \sum_{i=1}^n (
b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) 
) \,\boldsymbol{\alpha}_i
= -\frac{1}{n} A^T [\mathbf{b} - \bsigma(A \mathbf{x})].
$$

We turn to the Hessian. By symmetry, we can think of the $j$-th column of the Hessian as the gradient of the partial derivative with respect to $x_j$. Hence we start by computing the gradient of the $j$-th entry of the summands in the gradient of $\ell$. We note that, for $\boldsymbol{\alpha} = (\alpha_{1}, \ldots, \alpha_{d})  \in \mathbb{R}^d$,

$$
\nabla [(b - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{j}] 
= - \nabla [\sigma(\boldsymbol{\alpha}^T \mathbf{x})] \, \alpha_{j} 
=  - \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha}\alpha_{j}.
$$

Thus, using the fact that $\boldsymbol{\alpha} \alpha_{j}$ is the $j$-th column of the matrix $\boldsymbol{\alpha} \boldsymbol{\alpha}^T$, we get

$$
\mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b})
= \frac{1}{n} \sum_{i=1}^n \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\, \boldsymbol{\alpha}_i \boldsymbol{\alpha}_i^T
$$

where $\mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b})$ indicates the Hessian with respect to the $\mathbf{x}$ variables, for fixed $A, \mathbf{b}$.

For step size $\beta$, one step of gradient descent is therefore

$$
\mathbf{x}^{t+1}
= \mathbf{x}^{t} +\beta \frac{1}{n} \sum_{i=1}^n (
b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) 
) \,\boldsymbol{\alpha}_i.
$$

**NUMERICAL CORNER:** Before implementing GD for logistic regression, we return to our proof of convergence for smooth functions using a special case. We illustrate it on a random dataset. The functions $\hat{f}$, $\mathcal{L}$ and $\frac{\partial}{\partial x}\mathcal{L}$ are defined next.

In [None]:
def fhat(x,a):
    return 1 / ( 1 + np.exp(-np.outer(x,a)) )

In [None]:
def loss(x,a,b): 
    return np.mean(-b*np.log(fhat(x,a)) - (1 - b)*np.log(1 - fhat(x,a)), axis=1)

In [None]:
def grad(x,a,b):
    return -np.mean((b - fhat(x,a))*a, axis=1)

In [None]:
seed = 535
rng = np.random.default_rng(seed)
n = 10000
a = 2*rng.uniform(0,1,n) - 1
b = rng.integers(2, size=n)
x = np.linspace(-1,1,100)

In [None]:
plt.plot(x, loss(x,a,b), label='loss')
plt.legend()
plt.show()

We plot next the upper and lower bounds in the *Quadratic Bound for Smooth Functions* around $x = x_0$. It turns out we can take $L=1$ because all features are uniformly random between $-1$ and $1$. Observe that minimizing the upper quadratic bound leads to a decrease in $\mathcal{L}$.

In [None]:
x0 = -0.3
x = np.linspace(x0-0.05,x0+0.05,100)
upper = loss(x0,a,b) + (x - x0)*grad(x0,a,b) + (1/2)*(x - x0)**2 # upper approximation
lower = loss(x0,a,b) + (x - x0)*grad(x0,a,b) - (1/2)*(x - x0)**2 # lower approximation

In [None]:
plt.plot(x, loss(x,a,b), label='loss')
plt.plot(x, upper, label='upper')
plt.plot(x, lower, label='lower')
plt.legend()
plt.show()

$\unlhd$

We modify our implementation of gradient descent to take a dataset as input. Recall that to run gradient descent, we first implement a function computing a descent update. It takes as input a function `grad_fn` computing the gradient itself, as well as a current iterate and a step size. We now also feed a dataset as additional input.

In [None]:
def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
    gradient = grad_fn(curr_x, A, b)
    return curr_x - beta*gradient

We are ready to implement GD. Our function takes as input a function `loss_fn` computing the objective, a function `grad_fn` computing the gradient, the dataset `A` and `b`, and an initial guess `init_x`. Optional parameters are the step size and the number of iterations.

In [None]:
def gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e5)):
    
    # initialization
    curr_x = init_x
    
    # until the maximum number of iterations
    for iter in range(niters):
        curr_x = desc_update_for_logreg(grad_fn, A, b, curr_x, beta)
    
    return curr_x

To implement `loss_fn` and `grad_fn`, we define the sigmoid as above. Below, `pred_fn` is $\bsigma(A \mathbf{x})$. Here we write the loss function as

\begin{align*}
\ell(\mathbf{x}; A, \mathbf{b})
&= \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))
- (1-b_i) \log(1- \sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))\right\}\\
&= \mathrm{mean}\left(-\mathbf{b} \odot \mathbf{log}(\bsigma(A \mathbf{x})) - (\mathbf{1} - \mathbf{b}) \odot \mathbf{log}(\mathbf{1} - \bsigma(A \mathbf{x}))\right),
\end{align*}

where $\odot$ is the Hadamard product, or element-wise product (for example $\mathbf{u} \odot \mathbf{v} = (u_1 v_1, \ldots,u_n v_n)^T$), the logarithm (denoted in bold) is applied element-wise and $\mathrm{mean}(\mathbf{z})$ is the mean of the entries of $\mathbf{z}$ (i.e., $\mathrm{mean}(\mathbf{z}) = n^{-1} \sum_{i=1}^n z_i$). 

In [None]:
def pred_fn(x, A): 
    return sigmoid(A @ x)

In [None]:
def loss_fn(x, A, b): 
    return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))

In [None]:
def grad_fn(x, A, b):
    return -A.T @ (b - pred_fn(x, A))/len(b)

We can choosed a step size based on the smoothness of the objective as above. Recall that [`numpy.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) computes the Frobenius norm by default. 

In [None]:
def stepsize_for_logreg(A, b):
    L = LA.norm(A)**2 /len(b)
    return 1/L

We start with a simple dataset from UC Berkeley's [DS100](http://www.ds100.org) course. The file `lebron.csv` is available [here](https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils/datasets). Quoting a previous version of the course's textbook:

> In basketball, players score by shooting a ball through a hoop. One such player, LeBron James, is widely considered one of the best basketball players ever for his incredible ability to score. LeBron plays in the National Basketball Association (NBA), the United States's premier basketball league. We've collected a dataset of all of LeBron's attempts in the 2017 NBA Playoff Games using the NBA statistics website (https://stats.nba.com/).

We first load the data and look at its summary.

In [None]:
data = pd.read_csv('lebron.csv')
data.head()

In [None]:
data.describe()

The two columns we will be interested in are `shot_distance` (LeBron's distance from the basket when the shot was attempted (ft)) and `shot_made` (0 if the shot missed, 1 if the shot went in). As the summary table above indicates, the average distance was `10.6953` and the frequency of shots made was `0.565104`. We extract those two columns and display them on a scatter plot.

In [None]:
feature = data['shot_distance']
label = data['shot_made']

In [None]:
plt.scatter(feature, label, alpha=0.2)
plt.show()

As you can see, this kind of data is hard to vizualize because of the superposition of points with the same $x$ and $y$-values. One trick is to jiggle the $y$'s a little bit by adding Gaussian noise. We do this next and plot again.

In [None]:
label_jitter = label + 0.05*rng.normal(0,1,len(label))

In [None]:
plt.scatter(feature, label_jitter, alpha=0.2)
plt.show()

We apply GD to logistic regression. We first construct the data matrices $A$ and $\mathbf{b}$. To allow an affine function of the features, we add a column of $1$'s as we have done before.  

In [None]:
A = np.stack((np.ones(len(label)),feature),axis=-1)
b = label

We run GD starting from $(0,0)$ with a step size computed from the smoothness of the objective as above.

In [None]:
stepsize = stepsize_for_logreg(A, b)
print(stepsize)

In [None]:
init_x = np.zeros(A.shape[1])
best_x = gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=stepsize)
print(best_x)

Finally we plot the results.

In [None]:
grid = np.linspace(np.min(feature), np.max(feature), 100)
feature_grid = np.stack((np.ones(len(grid)),grid),axis=-1)
predict_grid = sigmoid(feature_grid @ best_x)

In [None]:
plt.scatter(feature, label_jitter, alpha=0.2)
plt.plot(grid,predict_grid,'r')
plt.show()

**Stochastic gradient descent** In [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD), a variant of gradient descent, we pick a sample $I_t$ uniformly at random in $\{1,\ldots,n\}$ and update as follows

$$
\mathbf{x}^{t+1}
= \mathbf{x}^{t} +\beta \, (
b_{I_t} - \sigma(\boldsymbol{\alpha}_{I_t}^T \mathbf{x}^t) 
) \, \boldsymbol{\alpha}_{I_t}.
$$

For the mini-batch version of SGD, we pick a random sub-sample $\mathcal{B}_t \subseteq \{1,\ldots,n\}$ of size $B$

$$
\mathbf{x}^{t+1}
= \mathbf{x}^{t} +\beta \frac{1}{B} \sum_{i\in \mathcal{B}_t} (
b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) 
) \,\boldsymbol{\alpha}_i.
$$

The key observation about the two stochastic updates above is that, in expectation, they perform a step of gradient descent. That turns out to be enough and it has computational advantages.

The only modification needed to the code is to pick a random mini-batch which can be fed to the descent update sub-routine as dataset.

In [None]:
def sgd_for_logreg(loss_fn, grad_fn, A, b, 
                   init_x, beta=1e-3, niters=int(1e5), batch=40):
    
    # initialization
    curr_x = init_x
    
    # until the maximum number of iterations
    nsamples = len(b)
    for _ in range(niters):
        I = rng.integers(nsamples, size=batch)
        curr_x = desc_update_for_logreg(
            grad_fn, A[I,:], b[I], curr_x, beta)
    
    return curr_x

**South African Heart Disease dataset** We analyze a dataset from [[ESL](https://web.stanford.edu/~hastie/ElemStatLearn/)], which can be downloaded [here](https://web.stanford.edu/~hastie/ElemStatLearn/data.html). Quoting [[ESL](https://web.stanford.edu/~hastie/ElemStatLearn/), Section 4.4.2] 

> The data [...] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).

We load the data, which we slightly reformatted and look at a summary. 

In [None]:
data = pd.read_csv('SAHeart.csv')
data.head()

Our goal to predict `chd`, which stands for coronary heart disease, based on the other variables (which are briefly described [here](https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt)). We use logistic regression again. 

We first construct the data matrices. We only use three of the predictors, as the convergence is quite slow.

In [None]:
feature = data[['tobacco', 'ldl', 'age']].to_numpy()
print(feature)

In [None]:
label = data['chd'].to_numpy()

In [None]:
A = np.concatenate((np.ones((len(label),1)),feature),axis=1)
print(A)

In [None]:
b = label

We use the same functions `loss_fn` and `grad_fn`, which were written for general logistic regression problems.

In [None]:
init_x = np.zeros(A.shape[1])

In [None]:
stepsize = stepsize_for_logreg(A, b)
print(stepsize)

In [None]:
best_x = gd_for_logreg(loss_fn, grad_fn, A, b, 
                       init_x, beta=stepsize, niters=int(1e6))

In [None]:
print(best_x)

The outcome is harder to vizualize. To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label $1$ whenever $\sigma(\boldsymbol{\alpha}^T \mathbf{x}) > 1/2$. We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)

In [None]:
def logis_acc(x, A, b):
    return np.sum((pred_fn(x, A) > 0.5) == b)/len(b)

In [None]:
logis_acc(best_x, A, b)

We also try mini-batch stochastic gradient descent (SGD). 

In [None]:
init_x = np.zeros(A.shape[1])

In [None]:
best_x = sgd_for_logreg(loss_fn, grad_fn, A, b, 
                        init_x, beta=stepsize, niters=int(1e6))

In [None]:
print(best_x)

In [None]:
logis_acc(best_x, A, b)

### Airline customer satisfaction dataset

We return to our original motivation, the [airline customer satisfaction](https://www.kaggle.com/datasets/sjleshrac/airlines-customer-satisfaction) dataset. We first load the dataset. We will need the column names later.

In [None]:
data = pd.read_csv('customer_airline_satisfaction.csv')
data.head()

In [None]:
column_names = data.columns.tolist()
print(column_names)

Our goal will be to predict the first column, `Satisfied`, from the rest of the columns. For this, we transform our data into Numpy arrays. We also standardize the columns by subtracting their mean and dividing by their standard deviation. This will allow to compare the influence of different features on the prediction. And we add a column of 1s to to account for the intercept.

In [None]:
y = data['Satisfied'].to_numpy()
X = data.drop(columns=['Satisfied']).to_numpy()
means = np.mean(X, axis=0)
stds = np.std(X, axis=0)
X_standardized = (X - means) / stds

In [None]:
A = np.concatenate((np.ones((len(y),1)),X_standardized),axis=1)
print(A)

In [None]:
b = y
print(b)

We use the functions `loss_fn` and `grad_fn` which were written for general logistic regression problems.

In [None]:
init_x = np.zeros(A.shape[1])

In [None]:
best_x = gd_for_logreg(loss_fn, grad_fn, A, b, 
                       init_x, beta=1e-3, niters=int(1e3))

In [None]:
print(best_x)

To interpret the results, we plot the coefficients in decreasing order. 

In [None]:
# Exclude the intercept for plotting
coefficients = best_x[1:]
features = column_names[1:]

# Sort the coefficients and corresponding feature names
sorted_indices = np.argsort(coefficients)
sorted_coefficients = coefficients[sorted_indices]
sorted_features = np.array(features)[sorted_indices]

In [None]:
# Create the horizontal bar plot
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_coefficients, color='skyblue')
plt.xlabel('Coefficient Value')
plt.title('Logistic Regression Coefficients')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

We see from the first ten bars or so that, as might be expected, higher ratings on various aspects of the flight generally contribute to a higher predicted likelihood of satisfaction (with one exception being `Gate location` whose coefficient is negative but may not be [statistically significant](https://en.wikipedia.org/wiki/Statistical_significance)). `Inflight entertainment` seems particularly influential. `Age` also shows the same pattern, something we had noticed in the introductory section through a different analysis. On the other hand, departure delay and arrival delay contribute to a lower predicted likelihood of satisfaction, again an expected pattern. The most negative influence however appears to come from `Class_Eco`.