# RIDGE REGRESSION COST FUNCTION

Ridge Regression, also known as Tikhonov regularization, is a type of linear regression that includes a regularization term. The purpose of this regularization term is to penalize large coefficients in the model, which can help prevent overfitting and improve the model's generalization to new data. The cost function for Ridge Regression combines the Residual Sum of Squares (RSS) with a penalty on the size of the coefficients.


The Ridge Regression cost function is given by:

$J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{m} \theta_j^2$

where:
- $J(\theta)$ is the cost function to be minimized.
- $n$ is the number of observations.
- $y_i$ is the actual value for the $i^{th}$ observation.
- $\hat{y}_i$ is the predicted value for the $i^{th}$ observation, which is calculated as $\hat{y}_i = \theta^T x_i$ where $\theta$ is the coefficient vector and $x_i$ is the feature vector for the $i^{th}$ observation.
- $\lambda$ is the regularization parameter, a hyperparameter that controls the amount of shrinkage: the larger the value of $\lambda$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.
- $\theta_j$ are the model coefficients, and $m$ is the number of features (excluding the intercept).
- The first term is the RSS divided by $2n$, which normalizes the RSS by the number of observations.
- The second term is the regularization term, where $\lambda \sum_{j=1}^{m} \theta_j^2$ adds a penalty for large coefficients.

How to update thetas:\
$\nabla_{\theta} J(\theta) = \frac{1}{m} X^T (X\theta - y) + \lambda \theta$
    
Where:
- The first part is classic gradient descent for linear regression
- $\lambda \theta$ is the regularization term where we multiply lambda (sometimes they use alpha) with all thetas except the bias

Full picture of updating thetas:\
$\theta := \theta - \alpha \left( \frac{1}{m} X^T (X\theta - y) + \lambda \theta \right)$
    

For info, sometimes you might se 1/2. Both formulations are mathematically valid and will lead to the same set of coefficients after optimization, although the effective value of $\lambda$ may need to be adjusted between the two formulations to achieve the same level of regularization. The choice of including the 1/2 factor is mostly a matter of convenience for simplifying calculus operations and does not impact the goal of regularization, which is to penalize large coefficients to improve model generalization.

$\lambda \sum_{j=1}^{m} \theta_j^2$\
$\frac{1}{2} \lambda \sum_{j=1}^{m} \theta_j^2$

# CLOSED FORM SOLUTION

Ridge regression extends linear regression by adding a regularization term to the cost function, which penalizes large coefficients to prevent overfitting. The closed-form solution for Ridge regression, also known as the Ridge regression normal equation, is given by:

$\hat{\theta} = ((X^T \cdot X)+ \lambda \cdot I)^{-1} \cdot (X^T \cdot y) $

where:
- $\hat{\theta}$ - is the vector of coefficients, meaning estimated model parameters that minimize the cost function.
- $X$ - is the matrix of feature values, with each row representing an observation and each column a feature. An additional column of ones is typically added to $X$ to include the intercept term.
- $y$ - is the target/labels vector.
- $\lambda$ - is the regularization parameter, a non-negative value that controls the strength of the regularization. Larger values of $\lambda$ impose a greater penalty on the size of the coefficients.
- $I$ - is the identity matrix, with the same number of rows and columns as the number of features (including the intercept). The first diagonal element is often set to 0 to exclude the intercept from regularization.
- $X^T$ - is the transpose of $X$.
- $(X^T X + \lambda I)^{-1}$ - is the inverse of $X^T X + \lambda I$.
- $X^T y$ - is the matrix multiplication of $X^T$ and $y$.


# You can find practical example in the polynomial_regression.ipynb notebook