# Regularisation 
## Prevents from overfitting the data 

<img align="Left" src="./images/C1_W3_LinearGradientRegularized.png"  style=" width:400px; padding: 10px; " >
<img align="Center" src="./images/C1_W3_LogisticGradientRegularized.png"  style=" width:400px; padding: 10px; " >

The slides above show the cost and gradient functions for both linear and logistic regression. Note:
- Cost
    - The cost functions differ significantly between linear and logistic regression, but adding regularization to the equations is the same.
- Gradient
    - The gradient functions for linear and logistic regression are very similar. They differ only in the implementation of $f_{wb}$.

## Cost functions with regularization
### Cost function for regularized linear regression

The equation for the cost function regularized linear regression is:
$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{1}$$ 
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b  \tag{2} $$ 


Compare this to the cost function without regularization (which you implemented in  a previous lab), which is of the form:

$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 $$ 

The difference is the regularization term,  <span style="color:blue">
    $\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 
    
Including this term encourages gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice.

Below is an implementation of equations (1) and (2). Note that this uses a *standard pattern for this course*,   a `for loop` over all `m` examples.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 

In [2]:
def compute_cost_linear_reg(X,y,w,b,lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """

    n,m = X.shape 
    cost = 0.
    for i in range(n):
      f_wb_i = np.dot(X[i],w) + b 
      cost = cost +  (f_wb_i - y[i]) ** 2 
    
    cost /= (2*n) 

    reg_cost = 0.
    for j in range(m):
      reg_cost += (w[j] ** 2 )
    
    reg_cost *=  (lambda_ / (2*n))

    total_cost = cost + reg_cost 
    return total_cost
    # m  = X.shape[0]
    # n  = len(w)
    # cost = 0.
    # for i in range(m):
    #     f_wb_i = np.dot(X[i], w) + b                                   #(n,)(n,)=scalar, see np.dot
    #     cost = cost + (f_wb_i - y[i])**2                               #scalar             
    # cost = cost / (2 * m)                                              #scalar  
 
    # reg_cost = 0
    # for j in range(n):
    #     reg_cost += (w[j]**2)                                          #scalar
    # reg_cost = (lambda_/(2*m)) * reg_cost                              #scalar
    
    # total_cost = cost + reg_cost                                       #scalar
    # return total_cost 


In [3]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
print(np.c_[X_tmp])
w_tmp = np.random.rand(X_tmp.shape[1])-0.5
print(w_tmp)
b_tmp = 0.5 
lambda_tmp = 0.7
cost_tmp = compute_cost_linear_reg(X_tmp,y_tmp,w_tmp,b_tmp,lambda_tmp)
print(f"Cost of the model is : {cost_tmp}")

[[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
  1.46755891e-01 9.23385948e-02]
 [1.86260211e-01 3.45560727e-01 3.96767474e-01 5.38816734e-01
  4.19194514e-01 6.85219500e-01]
 [2.04452250e-01 8.78117436e-01 2.73875932e-02 6.70467510e-01
  4.17304802e-01 5.58689828e-01]
 [1.40386939e-01 1.98101489e-01 8.00744569e-01 9.68261576e-01
  3.13424178e-01 6.92322616e-01]
 [8.76389152e-01 8.94606664e-01 8.50442114e-02 3.90547832e-02
  1.69830420e-01 8.78142503e-01]]
[-0.40165317 -0.07889237  0.45788953  0.03316528  0.19187711 -0.18448437]
Cost of the model is : 0.07917239320214277


### Cost function for regularized logistic regression
For regularized **logistic** regression, the cost function is of the form
$$J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{3}$$
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)  \tag{4} $$ 

Compare this to the cost function without regularization (which you implemented in  a previous lab):

$$ J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right] $$

As was the case in linear regression above, the difference is the regularization term, which is    <span style="color:blue">
    $\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 

Including this term encourages gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice. 

In [4]:
def sigmoid(z):
    """    
      Compute the sigmoid of z
    Args:
        z (ndarray): A scalar, numpy array of any size.
    Returns:
        g (ndarray): sigmoid(z), with the same shape as z
         
    """

    return 1/(1 + np.exp(-z))

In [5]:
def compute_cost_logistic_reg(X,y,w,b,lambda_ = 1): 
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    n,m = X.shape
    cost = 0. 

    for i in range(n):
        z_i = np.dot(X[i],w) + b
        f_wb_i = sigmoid(z_i)
        err_i = -y[i] * np.log(f_wb_i) - (1-y[i]) * np.log(1-f_wb_i)
        cost += err_i
    
    cost /= n 

    reg_cost = 0.
    for j in range(m):
      reg_cost += (w[j] ** 2)
    
    reg_cost = reg_cost * (lambda_ / (2 * n))

    total_cost = reg_cost + cost
    return total_cost



In [6]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.6850849138741673


## Gradient descent with regularization
The basic algorithm for running gradient descent does not change with regularization, it is:
$$\begin{align*}
&\text{repeat until convergence:} \; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$
Where each iteration performs simultaneous updates on $w_j$ for all $j$.

What changes with regularization is computing the gradients.

### Computing the Gradient with regularization (both linear/logistic)
The gradient calculation for both linear and logistic regression are nearly identical, differing only in computation of $f_{\mathbf{w}b}$.
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j \tag{2} \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} 
\end{align*}$$

* m is the number of training examples in the data set      
* $f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target

      
* For a  <span style="color:blue"> **linear** </span> regression model  
    $f_{\mathbf{w},b}(x) = \mathbf{w} \cdot \mathbf{x} + b$  
* For a <span style="color:blue"> **logistic** </span> regression model  
    $z = \mathbf{w} \cdot \mathbf{x} + b$  
    $f_{\mathbf{w},b}(x) = g(z)$  
    where $g(z)$ is the sigmoid function:  
    $g(z) = \frac{1}{1+e^{-z}}$   
    
The term which adds regularization is  the <span style="color:blue"> $$ \frac {\lambda}{m}w_j$$</span>.

In [7]:
def compute_gradient_linear_reg(X,y,w,b,lambda_ = 1):
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    n,m = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((m,))
    dj_db = 0.

    for i in range(n):
        f_wb_i = np.dot(X[i],w) + b 
        err_i = f_wb_i - y[i]
        for j in range(m):
          dj_dw[j] += err_i * X[i,j]
        dj_db += err_i
    
    dj_dw /= n 
    dj_db /= n 
    
    for j in range(m):
      dj_dw[j] += (lambda_/n) * w[j]
    
    return dj_dw,dj_db




In [8]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
print(X_tmp)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_dw_tmp, dj_db_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

[[4.17022005e-01 7.20324493e-01 1.14374817e-04]
 [3.02332573e-01 1.46755891e-01 9.23385948e-02]
 [1.86260211e-01 3.45560727e-01 3.96767474e-01]
 [5.38816734e-01 4.19194514e-01 6.85219500e-01]
 [2.04452250e-01 8.78117436e-01 2.73875932e-02]]
dj_db: 0.6648774569425726
Regularized dj_dw:
 [0.29653214748822276, 0.4911679625918033, 0.21645877535865857]


In [9]:
def compute_gradient_logistic_reg(X,y,w,b,lambda_ = 1):
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    n,m = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((m,))
    dj_db = 0.

    for i in range(n):
        f_wb_i = sigmoid(np.dot(X[i],w) + b) 
        err_i = f_wb_i - y[i]
        for j in range(m):
          dj_dw[j] += err_i * X[i,j]
        dj_db += err_i
    
    dj_dw /= n 
    dj_db /= n 
    
    for j in range(m):
      dj_dw[j] += (lambda_/n) * w[j]
    
    return dj_dw,dj_db


In [10]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
print(X_tmp)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_dw_tmp, dj_db_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

[[4.17022005e-01 7.20324493e-01 1.14374817e-04]
 [3.02332573e-01 1.46755891e-01 9.23385948e-02]
 [1.86260211e-01 3.45560727e-01 3.96767474e-01]
 [5.38816734e-01 4.19194514e-01 6.85219500e-01]
 [2.04452250e-01 8.78117436e-01 2.73875932e-02]]
dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
