# Over Fitting

Over fitting happens when the model fits the learning data very good (has cost near to zero), this causes a worng prediction value.

- what does that mean?
  - if the model covers the learning data perfectly then it won't recognize the needed Predicted data which is the main purpose of the model.
- Solutions?
  - More Learning data:
    - if the model overfits the given data then give the model more learning data to fit, so the model is nearly fits the data (min cost but not equal to zero)
  - Select Specified Features:
    - If the model contains many features, try to filter them to the important ones which don't cause overfitting
  - Regularization
    - Linear Regression with Regularization.
    - Logistic Regression with Regularization.

# **Regularization**

Regularization trying to penalize the model not to cause overfitting.


## **Linear Regression with Regularization**

As we know that the cost function for linear regression is defined by :
    $$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 $$ 
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b  $$ 

So in Regularization we add an extra term to penalize the cost function
$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 
, so it becomes as following:
$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 $$

When using th eGradient Descent algorithm to find the parameters $w, b$ , with this term the gradiend descent will minimize these parameters.

Let's code it


In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
def compute_cost_linear_reg(X, y, w, b, lambda_):
    lambda_ = lambda_ | 1
    m = len(y)
    J = 0
    cost = 0.
    
    # Compute the cost of a normal linear regression
    for i in range(m):
        cost += (np.dot(w, X[i]) + b - y[i])**2
    cost /= (2*m)
    
    # compute the regularization term cost
    reg = 0.
    for i in range(m):
        reg += w[i]**2
    reg *= lambda_ / (2*m)
    
    J = reg + cost
    return J

### **Gradient Descent for Linear Regression with Regularization.**

**As we know, we use gradiend descent algorithm to get the best parameters to minimize the model cost**

Gradient Descent for Linear Regression:
    $$\begin{align*}
\; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j}   \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$

where 
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j  \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})  
\end{align*}$$


In [None]:
def compute_gradient_descent_linear_reg(x,y, w, b, alpha, lambda_):
    m,n = x.shape
    
    dw = np.zeros((n,))
    db = 0.
    
    for i in range(m):
        # err = w.x + b - y
        err = np.dot(w, x[i]) + b - y[i]
        for j in range(n):
            # w1*x1 + w2*x2 + w3*x3 + b
            dw[j] += err * x[i,j]
        db += err
    dw /= m
    db /= m
    # compute the regularization term
    for j in range(n):
        dw[j] += lambda_ * w[j] / m
    return dw, db

## **Logistic Regression with Regularization**

As we know that the cost function for logistic regression is defined by :
$$ J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right] $$

where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)$$


So in Regularization we add an extra term to penalize the cost function
$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 
, so it becomes as following:

$$J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 $$
When using th eGradient Descent algorithm to find the parameters $w, b$ , with this term the gradiend descent will minimize these parameters.

Let's code it


In [4]:
def cost_logistic(X, y, w, b, lambda_):
    m = len(y)
    J = 0
    cost = 0.
    
    # Compute the cost of a normal logistic regression
    for i in range(m):
        cost += y[i] * np.log(1 / (1 + np.exp(-np.dot(w, X[i]) - b))) + (1 - y[i]) * np.log(1 - 1 / (1 + np.exp(-np.dot(w, X[i]) - b)))
    cost /= -m
    
    # compute the regularization term cost
    reg = 0.
    for i in range(m):
        reg += w[i]**2
    reg *= lambda_ / (2*m)
    
    J = reg + cost
    return J

### **Gradient Descent for Logistc Regression with Regularization.**

**As we know, we use gradiend descent algorithm to get the best parameters to minimize the model cost**

Gradient Descent for Logistic Regression:
    $$\begin{align*}
\; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j}   \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$

where 
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j  \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})  
\end{align*}$$
**Remember the sigmoid function**

$z = \mathbf{w} \cdot \mathbf{x} + b$  
    $f_{\mathbf{w},b}(x) = g(z)$  
    $g(z) = \frac{1}{1+e^{-z}}$  


In [6]:
def compuet_gradient_logistic(x, y, w, b, lambda_):
    m,n = x.shape
    
    dw = np.zeros((n,))
    db = 0.
    
    for i in range(m):
        # err = f_wb - y
        # f_wb = sigmoid(w.x + b)
        err = sigmoid(np.dot(w, x[i]) + b) - y[i]
        for j in range(n):
            # w1*x1 + w2*x2 + w3*x3 + b
            dw[j] += err * x[i,j]
        db += err
    dw /= m
    db /= m
    # compute the regularization term
    for j in range(n):
        dw[j] += lambda_ * w[j] / m
    return dw, db

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

**Finally we learn about the overfitting problem and how to avoid it using Regularization**