# RIDGE REGRESSION COST FUNCTION

Ridge Regression, also known as Tikhonov regularization, is a type of linear regression that includes a regularization term. The purpose of this regularization term is to penalize large coefficients in the model, which can help prevent overfitting and improve the model's generalization to new data. The cost function for Ridge Regression combines the Residual Sum of Squares (RSS) with a penalty on the size of the coefficients.

The Ridge Regression cost function is given by:

$J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{m} \theta_j^2$

where:
- $J(\theta)$ is the cost function to be minimized.
- $n$ is the number of observations.
- $y_i$ is the actual value for the \(i\)-th observation.
- $\hat{y}_i$ is the predicted value for the \(i\)-th observation, which is calculated as $\hat{y}_i = \theta^T x_i$ where $\theta$ is the coefficient vector and $x_i$ is the feature vector for the \(i\)-th observation.
- $\lambda$ is the regularization parameter, a hyperparameter that controls the amount of shrinkage: the larger the value of $\lambda$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.
- $\theta_j$ are the model coefficients, and $m$ is the number of features (excluding the intercept).
- The first term is the RSS divided by $2n$, which normalizes the RSS by the number of observations.
- The second term is the regularization term, where $\lambda \sum_{j=1}^{m} \theta_j^2$ adds a penalty for large coefficients.

### Practical Example with Ridge Regression Cost Function

Let's consider a simple example with a dataset containing only 2 observations and 1 feature to illustrate the computation of the Ridge Regression cost function.

Suppose we have the following dataset:

| Observation | Feature $x_1$ | Actual Value $y$ |
|-------------|---------------|------------------|
| 1           | 1             | 2                |
| 2           | 2             | 3                |

And we have a simple Ridge Regression model with the following parameters:

- $\theta_0 = 0.5$ (intercept)
- $\theta_1 = 1$ (coefficient for $x_1$)
- $\lambda = 0.1$

Our model predicts $\hat{y}$ as:

$\hat{y} = \theta_0 + \theta_1 x_1$

Let's compute the cost function $J(\theta)$ for this model and dataset.

### Step 1: Compute Predicted Values
Calculate the predicted values $\hat{y}$ for each observation.

For Observation 1: $\hat{y}_1 = 0.5 + 1 \times 1 = 1.5$  
For Observation 2: $\hat{y}_2 = 0.5 + 1 \times 2 = 2.5$

### Step 2: Compute RSS
Calculate the Residual Sum of Squares (RSS).

RSS = $(2 - 1.5)^2 + (3 - 2.5)^2 = 0.25 + 0.25 = 0.5$

### Step 3: Compute Regularization Term
Calculate the regularization term.

Regularization Term = $\lambda \sum_{j=1}^{m} \theta_j^2 = 0.1 \times 1^2 = 0.1$

### Step 4: Compute Cost Function
Combine the RSS and the regularization term to get the cost function $J(\theta)$.

$J(\theta) = \frac{1}{2 \times 2} \times 0.5 + 0.1 = 0.125 + 0.1 = 0.225$

### Interpretation
The cost function value of 0.225 represents the cost associated with our Ridge Regression model given the dataset and the chosen $\lambda$. It includes both the error in prediction (RSS) and the penalty for the size of the coefficient. Minimizing this cost function during training helps in finding the coefficients that not only fit the training data well but are also kept relatively small to avoid overfitting.

Compute the Ridge Regression cost function $J(\theta)$ using actual numbers with a dataset that has two features $x_1$ and $x_2$ and each feature has two observations. We'll follow the formula:

$J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{m} \theta_j^2$

### Given Dataset:
Assume our dataset consists of the following:

| $x_1$ (Feature 1) | $x_2$ (Feature 2) | $y$ (Actual Value) |
|-------------------|-------------------|--------------------|
| 1                 | 3                 | 2                  |
| 2                 | 4                 | 3                  |

### Model Parameters:
Let's assume the Ridge Regression model parameters are:
- Intercept $\theta_0 = 0.5$
- Coefficient for $x_1$, $\theta_1 = 1$
- Coefficient for $x_2$, $\theta_2 = 0.5$
- Regularization parameter $\lambda = 0.1$

### Steps to Compute $J(\theta)$:

#### Step 1: Compute Predicted Values $\hat{y}_i$
$\hat{y}_i = \theta_0 + \theta_1 x_{i1} + \theta_2 x_{i2}$ (In case of linear regression) \
For Observation 1: $\hat{y}_1 = 0.5 + 1 \times 1 + 0.5 \times 3 = 2.5$  
For Observation 2: $\hat{y}_2 = 0.5 + 1 \times 2 + 0.5 \times 4 = 4$

#### Step 2: Compute Sum of Squared Residuals
$\sum_{i=1}^{2} (y_i - \hat{y}_i)^2 = (2 - 2.5)^2 + (3 - 4)^2$

#### Step 3: Compute Regularization Term
$\lambda \sum_{j=1}^{2} \theta_j^2 = 0.1 \times (1^2 + 0.5^2)$

#### Step 4: Compute $J(\theta)$
$J(\theta) = \frac{1}{2 \times 2} \left( (2 - 2.5)^2 + (3 - 4)^2 \right) + 0.1 \times (1^2 + 0.5^2)$

Let's perform these calculations to find the value of $J(\theta)$.

The computed value of the Ridge Regression cost function $J(\theta)$ for the given dataset, with two features and two observations per feature, is $0.9375$. This cost value encapsulates both the model's prediction error (sum of squared residuals) and the penalty imposed on the size of the coefficients (regularization term), as dictated by the Ridge Regression framework. The regularization parameter $\lambda = 0.1$ in this case helps to control the magnitude of the coefficients, thereby aiming to reduce overfitting.

For info, sometimes you might se 1/2. Both formulations are mathematically valid and will lead to the same set of coefficients after optimization, although the effective value of $\lambda$ may need to be adjusted between the two formulations to achieve the same level of regularization. The choice of including the 1/2 factor is mostly a matter of convenience for simplifying calculus operations and does not impact the goal of regularization, which is to penalize large coefficients to improve model generalization.

$\lambda \sum_{j=1}^{m} \theta_j^2$ \
$\frac{1}{2} \lambda \sum_{j=1}^{m} \theta_j^2$

# CLOSED FORM SOLUTION

Ridge regression extends linear regression by adding a regularization term to the cost function, which penalizes large coefficients to prevent overfitting. The closed-form solution for Ridge regression, also known as the Ridge regression normal equation, is given by:

$\hat{\theta} = ((X^T \cdot X)+ \lambda \cdot I)^{-1} \cdot (X^T \cdot y) $

where:
- $\hat{\theta}$ - is the vector of coefficients, meaning estimated model parameters that minimize the cost function.
- $X$ - is the matrix of feature values, with each row representing an observation and each column a feature. An additional column of ones is typically added to $X$ to include the intercept term.
- $y$ - is the target/labels vector.
- $\lambda$ - is the regularization parameter, a non-negative value that controls the strength of the regularization. Larger values of $\lambda$ impose a greater penalty on the size of the coefficients.
- $I$ - is the identity matrix, with the same number of rows and columns as the number of features (including the intercept). The first diagonal element is often set to 0 to exclude the intercept from regularization.
- $X^T$ - is the transpose of $X$.
- $(X^T X + \lambda I)^{-1}$ - is the inverse of $X^T X + \lambda I$.
- $X^T y$ - is the matrix multiplication of $X^T$ and $y$.
 


To explain how the Ridge regression closed-form formula works step by step with the given dataset, let's go through the process:

### Given Dataset:

| $x_1$   | $x_2$   | $y$   |
|---------|---------|-------|
| 1       | 2       | 3     |
| 4       | 5       | 6     |

### Step 1: Augment the Feature Matrix $X$

First, we need to create the feature matrix $X$ and include a column of ones for the intercept term.

$X = \begin{bmatrix} 1 & 1 & 2 \\ 1 & 4 & 5 \end{bmatrix} $

### Step 2: Define the Target Vector $y$

The target vector $y$ consists of the output values.

$y = \begin{bmatrix} 3 \\ 6 \end{bmatrix} $

### Step 3: Choose a Regularization Parameter $\lambda$

For Ridge regression, we need to select a regularization parameter $\lambda$ that controls the amount of shrinkage applied to the coefficients. Let's assume $\lambda = 1$ for this example.

### Step 4: Construct the Identity Matrix $I$

Create an identity matrix $I$ that matches the number of features including the intercept. Since the intercept is not regularized, its corresponding value in $I$ is set to 0.

$I = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $

### Step 5: Calculate $X^T X + \lambda I$

Compute the matrix multiplication of $X^T$ (the transpose of $X$) and $X$, and then add $\lambda$ times the identity matrix $I$.

$X^T X + \lambda I = \begin{bmatrix} 2 & 5 & 7 \\ 5 & 17 & 22 \\ 7 & 22 & 29 \end{bmatrix} + \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 5 & 7 \\ 5 & 18 & 22 \\ 7 & 22 & 30 \end{bmatrix} $

### Step 6: Compute the Inverse

Find the inverse of the matrix $X^T X + \lambda I$.

$(X^T X + \lambda I)^{-1} $

### Step 7: Calculate $X^T y$

Multiply the transpose of $X$ by the target vector $y$.

$X^T y = \begin{bmatrix} 9 \\ 33 \\ 42 \end{bmatrix} $

### Step 8: Solve for $\hat{\theta}$

Multiply the inverse from Step 6 by the result from Step 7 to get the estimated coefficients $\hat{\theta}$.

$\hat{\theta} = (X^T X + \lambda I)^{-1} X^T y $

Let's carry out these computations to find the Ridge regression coefficients $\hat{\theta}$.

Using the Ridge regression closed-form formula on the given dataset, the estimated coefficients $\hat{\theta}$ are calculated as:

$\hat{\theta} = \begin{bmatrix} 1.8 \\ 0.45 \\ 0.45 \end{bmatrix} $

This vector $\hat{\theta}$ includes the estimated values for the intercept ($\theta_0 = 1.8$) and the coefficients for the features $x_1$ ($\theta_1 = 0.45$) and $x_2$ ($\theta_2 = 0.45$), after applying Ridge regularization with $\lambda = 1$. The regularization term has effectively shrunk the coefficients towards zero compared to what might have been obtained using ordinary least squares regression, thereby reducing the risk of overfitting and potentially improving the model's generalization performance.
