# Non-Regularized Logistic Regression

In this notebook we will use a logisitc model (also called sigmoid) $\sigma:\mathbb{R}\rightarrow(0,1)$, such that $\sigma(z)=\frac{1}{1+\exp(-z)}$ to solve a classification problem. However to get a variety of models, we shall also consider another function $f(\boldsymbol{x})=\langle \boldsymbol{x}, \boldsymbol{\theta} \rangle$ such that our hypothesis $h_{\theta}(\boldsymbol{x})=\sigma(f(\boldsymbol{x}))$.

The image of the hypothesis for a particular vector of features $\boldsymbol{x}$, which is a number between 0 and 1, may be thought of as the "confidence" the model has that $\boldsymbol{x}$ is a member of the positive class. Thus we can determine a threshold (0.5 for instance), such that if the confidence is greater or equal to the threshold we predict $\boldsymbol{x}$ as a member of the positive class:
$$y_{\mathrm{predicted}}(\boldsymbol{x}) = \left\{\begin{split}
& 1 \ \mathrm{if} \ h_{\theta}(\boldsymbol{x})\geq 0.5 \\
& 0 \ \mathrm{if} \ h_{\theta}(\boldsymbol{x})<0.5 
\end{split}\right..$$

## Cost function

For a logistic regression, the cost function $J(\boldsymbol{\theta})$ will not be defined as the usual mean square difference but rather as
$$J(\boldsymbol{\theta})=-\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\ln\left(h_{\theta}\left(\boldsymbol{x}^{(i)}\right)\right) + \left(1-y^{(i)}\right)\ln\left(1-h_{\theta}\left(\boldsymbol{x}^{(i)}\right)\right)\right].$$
If we define a vector $\boldsymbol{h}_{\theta}$ such that each of its components corresponds to the value $h_{\theta}\left(\boldsymbol{x}^{(i)}\right)$ for each training example, we can write the previous cost function in vector form as:
$$J(\boldsymbol{\theta})=-\frac{\langle \boldsymbol{y}, \ln(\boldsymbol{h}_{\theta})\rangle + \langle\boldsymbol{1}-\boldsymbol{y}, \ln\left(\boldsymbol{1}-\boldsymbol{h}_{\theta}\right)\rangle}{m},$$
or equivalently
$$J(\boldsymbol{\theta})=-\frac{\boldsymbol{y}^{T}\ln(\boldsymbol{h}_{\theta}) + \left(\boldsymbol{1}-\boldsymbol{y}\right)^{T}\ln\left(\boldsymbol{1}-\boldsymbol{h}_{\theta}\right)}{m},$$
where $\boldsymbol{1}$ is a vector whose components are all equal to 1.

We shall also use matrices to write the hypothesis vector valued function $\boldsymbol{h}_{\theta}:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}^{m}$. Using the matrix $X_{ij}=x^{(i)}_{j}$ we can first write $f$ as a vector valued function
$\boldsymbol{f}:\mathbb{R}^{n\times m}\rightarrow \mathbb{R}^{m}$ such that each component of $\boldsymbol{f}$ corresponds to the value of $f(\boldsymbol{x}^{(i)})$ for each training example as follows
$${f}(X)=X\boldsymbol{\theta}.$$
Moreover we can define a vector valued sigmoid function $\boldsymbol{\sigma}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{m}$ such that each component of $\boldsymbol{\sigma}(\boldsymbol{z})$ corresponds to $\sigma(z)$ as previously defined. This can be done using element-wise division as
$$\boldsymbol{\sigma}(\boldsymbol{z})=\frac{1}{\boldsymbol{1}+\exp(-\boldsymbol{z})}.$$

Using both of these new definitions we can now define $\boldsymbol{h}_{\theta}$ as follows
$$\boldsymbol{h}_{\theta}(X) = \frac{1}{\boldsymbol{1}+\exp(-\boldsymbol{X\boldsymbol{\theta}})},$$
such that this division referes to the element-wise reciprocral of $\boldsymbol{1}+\exp(-\boldsymbol{X\boldsymbol{\theta}})$.

## Cost function gradient
For the mathematical background we only need one more thing, the gradient of the cost, or equivalently the partial derivative of $J$ with respect to each parameter $\theta_{j}$. I won't provide a proof, but this partial derivates are exactly equal to
$$\frac{\partial J}{\partial \theta_{j}}=\frac{1}{m}\sum_{i=1}^{m}x_{j}^{(i)}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$$
which we may store in a vector $\nabla J$, called gradient, using the previously defined vector valued function as
$$\nabla J = \frac{X^{T}\left(\boldsymbol{h}_{\theta}-\boldsymbol{y}\right)}{m}$$
which will be required to optimize the parameters $\boldsymbol{\theta}$.

In [1]:
# Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

%matplotlib widget

## Modules

In [2]:
def plotData(df):
    fig, ax = plt.subplots()
    ax.scatter(df.x1[df.y==1], df.x2[df.y==1], label="Positive")
    ax.scatter(df.x1[df.y==0], df.x2[df.y==0], marker="x", label="Negative")
    ax.legend()
    return ax

In [3]:
def sigmoid(z):
    return 1./ (1 + np.exp(-z))

In [4]:
def cost(theta, X, y):
    m = y.size
    h = sigmoid(X @ theta)
    J = -(np.transpose(y) @ np.log(h) + np.transpose(1-y) @ np.log(1-h))/m
    return J

def grad(theta, X, y):
    m = y.size
    h = sigmoid(X @ theta)
    grad = np.transpose(X) @ (h-y)/m
    return grad

In [5]:
def predict(theta, X):
    h = sigmoid(X @ theta)
    m = h.size
    y_preds = np.zeros(m)
    for i in range(m):
        if h[i] >= 0.5:
            y_preds[i] = 1
        else:
            y_preds[i] = 0
    return y_preds

## Data

In [6]:
df = pd.read_csv("ex2data1.txt", sep=",")

# Features
X = df.drop("y", axis=1)
X = np.array(X)

# Targets
y = df["y"]

# Sample size
n = X[0,:].size
m = X[:,0].size

In [7]:
ax = plotData(df)
ax.set(title = "X space", xlabel = "Exam 1 score", ylabel = "Exam 2 score")
plt.show();

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [8]:
X = np.c_[np.ones((m,1)), X]
theta0 = np.zeros(n+1)

In [9]:
J = cost(theta0, X, y)
dJ = grad(theta0, X, y)

print(f"Cost at initial theta:\n \t\t\t{J:.3f}")
print("Grad at initial theta:")

for i in range(dJ.size):
    print(f"\t\t\t{dJ[i]:.2f}")

Cost at initial theta:
 			0.693
Grad at initial theta:
			-0.10
			-12.01
			-11.26


## Minimization of Cost using BFGS

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
res = minimize(lambda t: cost(t, X, y),
               theta0,
               method='BFGS',
               jac=lambda t: grad(t, X, y),
               options={'disp': False})

theta = res.x

print("Optimal parameters:", end = " ")
print(theta)

print(f"Cost with optimal parameters: {cost(theta, X, y):.3f}")

Optimal parameters: [-25.16133284   0.2062317    0.2014716 ]
Cost with optimal parameters: 0.203


In [12]:
# Plot boundary
f = lambda t: (-theta[1]*t-theta[0])/theta[2]
t = np.array([29, 100])
ax.plot(t, f(t), c='#000000', label='Boundary')
ax.legend();

## Model accuracy (on training set)

In [13]:
y_preds = predict(theta, X)

y_preds_correct = y_preds
for i in range(y_preds.size):
     y_preds_correct[i] = float(y_preds[i]==y[i])
    
print(f"Train Accuracy: {np.mean(y_preds_correct) * 100}")

Train Accuracy: 89.0


# Regularized Logistic Regression

To get more interesting boundaries, we may consider using more complex functions $f(\boldsymbol{x})=\langle \boldsymbol{x}, \boldsymbol{\theta} \rangle$. For instance, we may use a polynomial function using a vector:
$$\boldsymbol{x} = \begin{bmatrix}
1 \\
x \\
x^{2} \\
\vdots \\
x^{N}
\end{bmatrix}.
$$

If we want to use 2 features $x_{1}, x_{2}$ we could consider a vector containing combinations of their powers:
$$\boldsymbol{x} = \begin{bmatrix}
1 \\
x_{1} \\
x_{2} \\
x_{1}^{2} \\
x_{1}x_{2} \\
x_{2}^{2} \\
x_{1}^{3} \\
\vdots \\
x_{1}^{i-k}x_{2}^{k}\\
\end{bmatrix}.
$$
However by doing so we might run into the problem of overfitting, which means that our boundary fits perfectly the data but it's nonsensical when extrapolating to data outside the training set.

To use the advantages of more complex functions without overfitting we use regularization, which consists on penalizing the parameters $\theta_{j}$ (excluding the independent term) so that they cannot be "big" in magnitude. In order to do so, we modify the cost function such that
$$J\left(\boldsymbol{\theta}\right)= -\frac{\boldsymbol{y}^{T}\ln(\boldsymbol{h}_{\theta}) + \left(\boldsymbol{1}-\boldsymbol{y}\right)^{T}\ln\left(\boldsymbol{1}-\boldsymbol{h}_{\theta}\right)}{m} + \frac{\lambda}{2m}\sum_{j=1}^{N}\theta_{j}^{2}$$

$$J\left(\boldsymbol{\theta}\right)= -\frac{\boldsymbol{y}^{T}\ln(\boldsymbol{h}_{\theta}) + \left(\boldsymbol{1}-\boldsymbol{y}\right)^{T}\ln\left(\boldsymbol{1}-\boldsymbol{h}_{\theta}\right) - \frac{\lambda}{2}\sum_{j=1}^{N}\theta_{j}^{2}}{m},$$
where $\lambda$ is a regularization parameter that allows us to change how much we want to penalize the coefficients $\theta_{j}$.

## Modules

In [14]:
def mapFeatures(X1, X2, degree=6):
    """
    Creates and returns a matrix that has additional polynomial features of a specified degree
    """
    Xout = np.ones(X1.size)
    
    for i in range(degree):
        for j in range(i):
            Xout = np.c_[Xout, (X1**(i-j))*(X2**j)]
    
    return Xout

In [16]:
def costReg(theta, X, y, l):
    """
    Returns the regularized cost
    """
    m = y.size
    h = sigmoid(X @ theta)
    tempTheta = theta
    tempTheta[0] = 0
    
    J = - (np.transpose(y) @ np.log(h) + np.transpose(1-y) @ np.log(1-h) - l/2  * np.transpose(tempTheta) @ tempTheta)/m
    return J

def gradReg(theta, X, y, l):
    """
    Returns the regularized cost gradient
    """
    m = y.size
    h = sigmoid(X @ theta)
    tempTheta = theta
    tempTheta[0] = 0
    
    grad = (np.transpose(X) @ (h-y) + l*tempTheta)/m
    return grad

## Data

In [17]:
df = pd.read_csv("ex2data2.txt", sep=",")

# Features
X = df.drop("y", axis=1)
X = np.array(X)

# Targets
y = df["y"]

# Sample size
m = X[:,0].size

In [18]:
ax = plotData(df)
ax.set(title = "X space", xlabel = "Microchip Test 1", ylabel = "Microchip Test 2")
plt.show();

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [19]:
X = mapFeatures(X[:,0], X[:,1])

In [20]:
initial_theta = np.zeros(X[0, :].size)
l = 1

print(f"Cost with initial parameters: {costReg(initial_theta, X, y, l):.3f}")
print("Gradient at initial theta:")
grad = gradReg(initial_theta, X, y, l) 
for i in range(grad.size):
    print(f"{grad[i]:.4f}")

Cost with initial parameters: 0.693
Gradient at initial theta:
0.0085
0.0188
0.0503
0.0115
0.0184
0.0073
0.0082
0.0393
0.0022
0.0129
0.0031
0.0200
0.0043
0.0034
0.0058
0.0045


In [21]:
res = minimize(lambda t: costReg(t, X, y, l),
               initial_theta,
               method='BFGS',
               jac=lambda t: gradReg(t, X, y, l),
               options={'disp': False})

theta = res.x
print(f"Cost with optimal parameters: {costReg(theta, X, y, l):.3f}")

Cost with optimal parameters: 0.621
