# Ridge Regression Cost Function

Ridge Regression, also known as Tikhonov regularization, is a type of linear regression that includes a regularization term. The purpose of this regularization term is to penalize large coefficients in the model, which can help prevent overfitting and improve the model's generalization to new data. The cost function for Ridge Regression combines the Residual Sum of Squares (RSS) with a penalty on the size of the coefficients.

The Ridge Regression cost function is given by:

$J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{m} \theta_j^2$

where:
- $J(\theta)$ is the cost function to be minimized.
- $n$ is the number of observations.
- $y_i$ is the actual value for the \(i\)-th observation.
- $\hat{y}_i$ is the predicted value for the \(i\)-th observation, which is calculated as $\hat{y}_i = \theta^T x_i$ where $\theta$ is the coefficient vector and $x_i$ is the feature vector for the \(i\)-th observation.
- $\lambda$ is the regularization parameter, a hyperparameter that controls the amount of shrinkage: the larger the value of $\lambda$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.
- $\theta_j$ are the model coefficients, and $m$ is the number of features (excluding the intercept).
- The first term is the RSS divided by $2n$, which normalizes the RSS by the number of observations.
- The second term is the regularization term, where $\lambda \sum_{j=1}^{m} \theta_j^2$ adds a penalty for large coefficients.

### Practical Example with Ridge Regression Cost Function

Let's consider a simple example with a dataset containing only 2 observations and 1 feature to illustrate the computation of the Ridge Regression cost function.

Suppose we have the following dataset:

| Observation | Feature ($x_1$ | Actual Value ($y$ |
|-------------|------------------|---------------------|
| 1           | 1                | 2                   |
| 2           | 2                | 3                   |

And we have a simple Ridge Regression model with the following parameters:

- $\theta_0 = 0.5$ (intercept)
- $\theta_1 = 1$ (coefficient for $x_1$
- $\lambda = 0.1$

Our model predicts $\hat{y}$ as:

$\hat{y} = \theta_0 + \theta_1 x_1$

Let's compute the cost function $J(\theta)$ for this model and dataset.

### Step 1: Compute Predicted Values
Calculate the predicted values $\hat{y}$ for each observation.

For Observation 1: $\hat{y}_1 = 0.5 + 1 \times 1 = 1.5$  
For Observation 2: $\hat{y}_2 = 0.5 + 1 \times 2 = 2.5$

### Step 2: Compute RSS
Calculate the Residual Sum of Squares (RSS).

RSS = $(2 - 1.5)^2 + (3 - 2.5)^2 = 0.25 + 0.25 = 0.5$

### Step 3: Compute Regularization Term
Calculate the regularization term.

Regularization Term = $\lambda \sum_{j=1}^{m} \theta_j^2 = 0.1 \times 1^2 = 0.1$

### Step 4: Compute Cost Function
Combine the RSS and the regularization term to get the cost function $J(\theta)$.

$J(\theta) = \frac{1}{2 \times 2} \times 0.5 + 0.1 = 0.125 + 0.1 = 0.225$

### Interpretation
The cost function value of 0.225 represents the cost associated with our Ridge Regression model given the dataset and the chosen $\lambda$. It includes both the error in prediction (RSS) and the penalty for the size of the coefficient. Minimizing this cost function during training helps in finding the coefficients that not only fit the training data well but are also kept relatively small to avoid overfitting.