# Regularized Logistic Regression 

Similarly to linear regression, logistic regression can also be regularized. The regularization term is added to the cost function and has the same interpretation as before: it prevents overfitting the training data. 



Let's refresh our memory about the cost function of logistic regression: 

$$ \displaystyle J(\vec{w}, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \left( f_{\vec{w}, b}(\vec{x}^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - f_{\vec{w}, b}(\vec{x}^{(i)}) \right) \right] $$

where $f_{\vec{w}, b}(\vec{x}) = \displaystyle \frac{1}{1 + e^{-z}} $ is the sigmoid function, and $z$ is a function (linear or polynomial) of the input features $\vec{x}$ and the weights $\vec{w}$ and the bias $b$.

We can generalize it to take advantage of regularization:

$$ \displaystyle J(\vec{w}, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \left( f_{\vec{w}, b}(\vec{x}^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - f_{\vec{w}, b}(\vec{x}^{(i)}) \right) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 $$

where $\lambda$ is the regularization parameter.

As we did for linear regression, we can use gradient descent to find the optimal values of $\vec{w}$ and $b$. 

The partial derivatives of the cost function with respect to $\vec{w}$ and $b$ are: 

$$ \displaystyle \frac{\partial J}{\partial \vec{w}} = \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) \vec{x}^{(i)} + \frac{\lambda}{m} \vec{w} $$ 

$$ \displaystyle \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) $$

Once again, **gradient descent** is implemented in code by simultaneously updating the values of $\vec{w}$ and $b$, as follows: 

$$ \vec{w} = \vec{w} - \alpha \frac{\partial J}{\partial \vec{w}} = \vec{w} - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) \vec{x}^{(i)} + \frac{\lambda}{m} \vec{w} \right) $$ 

$$ b = b - \alpha \frac{\partial J}{\partial b} = b - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) \right) $$

where $\alpha$ is the learning rate. 

Bear in mind that these are variable assignments, not equations. 

Let's implement this in code.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

In [22]:
# defining an auxiliary sigmoid function 

def sigmoid(fun, X, w, b):
    z = fun(X, w, b)
    return 1/(1 + np.e**(-1 * z))

In [29]:
# implementing the updated cost function

def J_logistic_reg(X, y, w, b, fun, lambda_=1): 
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      fun (function)  : function to be passed to a sigmoid function to perform (g(f()))
      lambda_ (scalar): Controls amount of regularization
    Returns:
      cost (scalar):  cost 
    """

    m, n = X.shape 
    cost = 0. 
    loss = 0. 
    reg_cost = 0. 

    for i in range(m):
        z_i = fun
        y_hat_i = sigmoid(z_i, X[i], w, b)
        loss += -y[i] * np.log(y_hat_i) - (1 - y[i]) * np.log(1 - y_hat_i)
    loss /= m 

    for j in range(n): 
        reg_cost += (w[j]**2)
    reg_cost *= (lambda_ / (2*m))

    cost = loss + reg_cost
    return cost  

Let's test this with some random data.

In [27]:
def linear_reg(X, w, b): 
    return np.dot(X, w) + b

In [30]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = J_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, fun=linear_reg, lambda_=lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.6850849138741673


Let's also implement the gradient descent algorithm.

In [70]:
def compute_gradient(X, y, w, b, lambda_, type_, fun_=linear_reg):
  """
   Computes the gradient for linear or logistic regression  
   Args:
     X (ndarray (m,n)  : Data, m examples with n features
     y (ndarray (m,))  : target values
     w (ndarray (n,))  : model parameters  
     b (scalar)        : model parameter
     lambda_ (scalar)  : Controls amount of regularization
     type_   (string)  : Either 'linreg' or 'logreg' 
     fun_    (function): function to be passed to a sigmoid function to perform (g(f()))
   Returns
     dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w. 
     dj_db (scalar)            : The gradient of the cost w.r.t. the parameter b. 
   """ 

  m, n = X.shape
  dj_dw = np.zeros((n, ))
  dj_db = 0.0

  if type_ not in ['logreg', 'linreg']:
   raise ValueError(f'Incorrect value for "type_": {type_}. Please pick either "linreg" or "logreg". ')


  def y_hatfun(X, w, b, TYPE):
   if TYPE == 'linreg':
      return linear_reg(X, w, b)
   elif TYPE == 'logreg':
      return sigmoid(fun_, X, w, b)    
   
  for i in range(m):
     y_hat_i = y_hatfun(X[i], w, b, TYPE=type_)
     err_i = y_hat_i - y[i]
     for j in range(n):
        dj_dw[j] += err_i * X[i, j]
     dj_db += err_i
  dj_dw /= m 
  dj_db /= m  
  for j in range(n): 
     dj_dw[j] += (lambda_/m) * w[j]  
  
  return dj_db, dj_dw 

Let's test this with some random data.

In [74]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp = compute_gradient(X_tmp, y_tmp, w_tmp, b_tmp, lambda_=lambda_tmp, type_='logreg')

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851497]
