# **Cost function**

---
> Cost Function : a step to minimize the loss or reach the global minimum value \
    
    Has more than one type :

  >- Mean Absolute Error : Measures how far the predictions are from the actual values on average without considering direction
  MAE Equation -> j ( w , b ) = ( 1 / 2m ) * Sum(|yi - yi^|)

  >- Mean Square Error : sometimes i need to make the error bigger than actually it is to be more accurate
  MSE Equation -> j ( w , b ) = ( 1 / 2m ) * Sum(yi - yi^)^2

  >- R^2 (Coefficient Of Determination) : tells how much of the variance in the real data is explained by your model
  Equation -> Sum(yi - yi^)^2 / Sum(yi - average(y))^2

  where :
  
  - i -> as an index from 1 to m \
  - y^ -> wx + b \
  - m -> number of samples
---

In [None]:
import numpy as np


def cost_function(x, y, w, b):


    """
    Computes the cost function for linear regression.

    Args:
      x (ndarray (m,)): Data, m examples
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters

    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
              to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0]

    #TODO: compute the cost function result
    # y^ = wx + b
    prediction= w*x+b

    total_cost= np.sum((prediction - y)**2)

    #END OF CODE
    # MSE
    return total_cost/(2*m)

# **Gradient/Derivative terms**

---

how do we minimize the loss ?

> Gradient Descent : Uses derivatives to update the weights(w) and bias(b) in order to minimize the loss \
\
    - start with random w , b \
    - keep changing them to reduce j(w , b) -> (cost function) \
    - end up at the minimum

  >w = w0 - alpha * d/dw j(w , b) \
  d/dw = (1 / m) * Sum(yi - yi^) * x

  >b = b0 - alpha * d/db j(w , b) \
  d/db = (1 / m) * Sum(yi - yi^)

>alpha : learning rate that controls how big each step is then updating w , b during training \
If TOO BIG , results overshooting and hardly reach the minimum \
If TOO SMALL , very slow training & might take too many epochs to learn anything \
To Choose a good learning rate try : 0.001 , 0.01 , 0.1 , 1 , ... \

>If (Derivative * alpha) < 0 , It is in the left side of the minimum point so we should go right \
If (Derivative * alpha) > 0 , It is in the right side of the minimum point so we should go left \
If (Derivative * alpha) = 0 , Stop Updating
---

In [None]:
def compute_gradient(x, y, w, b):

    """
    Computes the gradient for linear regression
    Args:
      x (ndarray (m,)): Data, m examples
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b
     """
    dj_dw = 0
    dj_db = 0

    # Number of training examples
    m = x.shape[0]

    #TODO: compute the gradients
    # y^ = wx + b
    prediction= w*x+b

    # error = y^ - y
    errors= prediction - y

    # derivatives
    dj_dw = (1/m) * np.dot(errors, x)
    dj_dw = (1/m) * np.sum(errors * x)

    dj_db = (1/m) * np.sum(errors)

    #END OF CODE

    return dj_dw, dj_db

# **Gradient Descent**

In [None]:
def gradient_descent(x, y, w_in, b_in, alpha, num_iters):
    """
    Performs gradient descent to fit w,b. Updates w,b by taking
    num_iters gradient steps with learning rate alpha

    Args:
      x (ndarray (m,))  : Data, m examples
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters
      alpha (float):      Learning rate
      num_iters (int):    number of iterations to run gradient descent
      dj_dw, dj_db:       The gradienta

    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      """

    w = w_in
    b = b_in

    #TODO: update the weights using gradient descent algorithm

    for i in range(num_iters):
        dj_dw, dj_db = compute_gradient(x, y, w, b)

        w = w - alpha * dj_dw
        b = b - alpha * dj_db

    #END OF CODE

    return w, b #return w and b

# **Polynomial Regression**

---
>Equation : y^ = w0 + w1 * x + w2 * (x ^ 2) + ... + wn * (x ^ 2) \
or we can reduce the degree : w2 * root(x) , ... \
when data can not be fitted by a line so we need a curved line
---

# **Multiple Regression**


---
>Equation : y^ = w0 + w1 * x1 + w2 * x2 + ... + wn * xn \
If I will convert it as two vectors then will start x with x0 = 1 \
and will be W(transpose) * x or Vise Versa
---

# **Steps of Regression**

---
>1 - initialize w , b with random values & choose learning rate \
2 - use the initialized w , b to predict the output \
3 - calculate cost function\
4 - calculate the gradient \
5 - update w , b \
6 - repeat form 2 to 5 until converge to the minimum or achieve maximum iterations
---


# **Bias & Variance**

**Variance** :
- error due to **too much complexity** in algorithm
- leads to the algorithm being **highly sensitive** to **high degrees** of variation in your training data , which can lead to **overfit**



**Bias** :
- error due to overly **simplistic assumptions** in algorithm
- leads the model to **underfit** , making it **hard** for it to have **high predictive accuracy** and for you to **generalize your knowledge** from the training set to the test set

#**Regularization**

- Technique used to penalize large coefficients (weights) in a regression model
- add a penalty term to the cost function
- without it , model can become too complex , fitting training data perfectly but performing poorly on new unseen data , which is overfitting

**L1 Regularization : Lasso Regression**

- Equation : y^ = Cost Function + ( (lamda / m) * Sum(|Wj|))

- Effect :
  - Forces some weights to become exactly zero
  - Performs feature selection automatically
  - Useful when you suspect that only a few features are important

**L2 Regularization : Ridge Regression**

- Equation : y^ = Cost Function + ( (lamda / m) * Sum( (Wj) ^ 2 ))

- Effect :
  - Shrinks weights towards zero but never exactly zero
  - Reduces model complexity
  - Smooths the fit
  - Useful when many features are relevant