In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Cost functions with regularization
### Cost function for regularized linear regression

The equation for the cost function regularized linear regression is:
$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{1}$$ 
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b  \tag{2} $$ 


Compare this to the cost function without regularization (which you implemented in  a previous lab), which is of the form:

$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 $$ 

The difference is the regularization term,  <span style="color:blue">
    $\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 
    
Including this term encourages gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice.

Below is an implementation of equations (1) and (2). Note that this uses a *standard pattern for this course*,   a `for loop` over all `m` examples.

In [2]:
def comput_cost_function(x,y,w,b,lambda_):
    
    m=len(x)
    n=len(w)
    f_wb=np.dot(x,w)+b
    cost_sum=0
    
    for i in range(m):
        cost_sum +=(f_wb[i]-y[i])**2
        
    cost=1/(2*m) *cost_sum
    
    reg_cost=0
    
    for i in range (n):
        reg_cost+=(w[i]**2)
    
    reg_cost=(lambda_ /(2*m)) * reg_cost
    
    total_cost=cost + reg_cost
    return total_cost
    
    
    

In [11]:
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = comput_cost_function(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.2158519823228361


### Cost function for regularized logistic regression
For regularized **logistic** regression, the cost function is of the form
$$J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{3}$$
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)  \tag{4} $$ 

Compare this to the cost function without regularization (which you implemented in  a previous lab):

$$ J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right] $$

As was the case in linear regression above, the difference is the regularization term, which is    <span style="color:blue">
    $\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 

Including this term encourages gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice. 

In [13]:
def sigmoid (z):
    # g= sigmoid(z)
    g=1/(1+np.exp(-z))
    return g

In [15]:
def compute_cost_logistic_reg(x,y,w,b,lambda_):
    m = x.shape[0]
    n=len(w)
    cost=0.0
    for i in range(m):
        z_i=np.dot(x[i],w)+b
        f_wb_i=sigmoid(z_i)
        cost  = cost + ( - y[i]*np.log(f_wb_i) - (1-y[i]) * np.log(1-f_wb_i))
        
    cost = cost / m
    
    reg_cost=0
    
    for i in range (n):
        reg_cost+=(w[i]**2)
    
    reg_cost=(lambda_ /(2*m)) * reg_cost
    
    total_cost=cost + reg_cost
      
    return total_cost
        

In [18]:
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.8200211756418744


## Gradient descent with regularization
The basic algorithm for running gradient descent does not change with regularization, it is:
$$\begin{align*}
&\text{repeat until convergence:} \; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$
Where each iteration performs simultaneous updates on $w_j$ for all $j$.

What changes with regularization is computing the gradients.

### Computing the Gradient with regularization (both linear/logistic)
The gradient calculation for both linear and logistic regression are nearly identical, differing only in computation of $f_{\mathbf{w}b}$.
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j \tag{2} \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} 
\end{align*}$$

* m is the number of training examples in the data set      
* $f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target

      
* For a  <span style="color:blue"> **linear** </span> regression model  
    $f_{\mathbf{w},b}(x) = \mathbf{w} \cdot \mathbf{x} + b$  
* For a <span style="color:blue"> **logistic** </span> regression model  
    $z = \mathbf{w} \cdot \mathbf{x} + b$  
    $f_{\mathbf{w},b}(x) = g(z)$  
    where $g(z)$ is the sigmoid function:  
    $g(z) = \frac{1}{1+e^{-z}}$   
    
The term which adds regularization is  the <span style="color:blue">$\frac{\lambda}{m} w_j $</span>.

In [21]:
def model_with_vectorization(x,w_init,b_init):
    n=x.shape[0]
    
    f_wb=np.dot(x,w_init) + b_init
    return f_wb
    

In [26]:
def compute_gradient_linear_reg(x,y,w,b,lambda_):
        f_wb = model_with_vectorization(x,w,b)

        dj_dw=0
        dj_db=0
        m=x.shape[0]
        n=len(w)
        
        for i in range(m):
            dj_dw_i = (f_wb[i]- y[i]) *x[i]
            dj_dw = dj_dw + dj_dw_i
            dj_db_i  = (f_wb[i] - y[i])
            dj_db = dj_db + dj_db_i


        dj_dw = dj_dw / m
        dj_db = dj_db / m
        
        for j in range(n):
            dj_dw[j]=dj_dw[j]+ (lambda_ / m)* w[j]
        
        return dj_dw,dj_db
    

In [29]:
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_dw_tmp,dj_db_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}" )
print(f"dj_dw: {dj_dw_tmp}")

dj_db: 0.8958897321885793
dj_dw: [0.52951991 0.67782081 0.4320933 ]


In [30]:
def compute_gradient_logistic_reg(x,y,w,b,lambda_):
        f_wb = sigmoid(np.dot(x,w)+b)

        dj_dw=0
        dj_db=0
        m=x.shape[0]
        n=len(w)
        
        for i in range(m):
            dj_dw_i = (f_wb[i]- y[i]) *x[i]
            dj_dw = dj_dw + dj_dw_i
            dj_db_i  = (f_wb[i] - y[i])
            dj_db = dj_db + dj_db_i


        dj_dw = dj_dw / m
        dj_db = dj_db / m
        
        for j in range(n):
            dj_dw[j]=dj_dw[j]+ (lambda_ / m)* w[j]
        
        
        return dj_dw,dj_db
    


 

In [33]:
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_dw_tmp, dj_db_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}" )
print(f"dj_dw: {dj_dw_tmp}" )

dj_db: 0.4213877251839279
dj_dw: [0.28426891 0.35888777 0.34538684]
