# Ridge Regression

## Short Introduction 
- Ridge regression is a varioation of linear regression, specifically designed to address multicollinerarity in the dataset.
- In linear regression, the goal is to find the best-fitting hyperplane that minimizes the sum of squared differences between the observed and predicted values. However, when there are highly correlated variables, linear regression may become unstable and provide unreliable estimates. 

- Multicollinearity exist when two or more of the predictors in a regression model are moderately or highly correlated with one another. 

- Ridge regression introduces a regularization term that penalizes large coefficients, helping to stabilize the model and prevent overfitting.
- The regularization term, also known as the L2 penalty, adds a constraint to the optimization process, influencing the model to choose smaller coefficinets for the predictors. 
- By striking a balance between fitting the data well and keeping the coefficients in check, ridge regression proves valuable in improving the robustness and performance of linear regression models, espcially in situations with multicollinearity.

### Linear Regression --for the good star
- Let's briefly recall what linear regression was about. 
- In linear regression, the model training essentially involves finding the appropriate values for coefficients. 
- This is done using the method of least squares. One seeks the values B0,B1,... that minimize the Residual Sum of Squares: 
- ![alt text](image.png)

### Ridge Regression --definition 
- Ridge regression is very similar to the method of least squares, with the exception that the coefficients are estimated by minimizing a slightly different quantity. 
- In reality, it's the same quantity, just with soething more, with something we call a shrinkage penalty. 
- ![alt text](image-1.png)
- Before we explain what ridge regression is, let'S find out what the mysterious shrinkage penalty is all about. 

### Shrinkage penalty --aid in learning 
- The shrinkage penalty in ridge regression 
- ![alt text](image-2.png)
- refers to the regularization term added to the linear regression equation to prevent overfitting and address multicollinearity. 
- In ridge regression, the objective is to minimized the sum of squared differences between observed and predicted values. 
- However, to this, a penalty term is added, which is proportional to the square of the magnitude of the coefficients. 
- This penalty term is also known as the l2 norm Euclidean norm. 

- 𝜆≥0 is called the tuning parameter of the method, which is chosen separately. 
- The parameter 𝜆 controls how strongly the coefficients are shrunk toward 0. 
- When 𝜆=0, the penalty has no effect, and ridge regression reduces to the ordinary least squares method. 
- However, as 𝜆→∞ the impact of the penalty grows, and the estimates of the coefficients Bj in ridge regression shrink towards zero. 

### How to chosee 𝜆?
- At the beginning, it's not known. 
- The only way is to test many values, and that's typically how it's done. 
- However, there are many algorithm implementations that asist in selecting the appropriate 𝜆 like "cross-validation".

### Why you should sclade predictors? 
- It should also be notes that the shrinkage penalty is applied exclusively to the coefficients B1,...Bp, but it does not affect the intercept term B0. 
- We do not shrink the intercept -- it represents the prediction of the mean value of the dependent variable when all predictors are equal to 0. 
- Assuming that the variables have been centered to have a mean of zero before conductiong ridge regression, the estimated intercept will take form 
- ![alt text](image-3.png)

- It should be emphasized that scaling predictors matters. 
- In linear regression, multiplying the predictor Xj by a constand c reduces the estimated parameter by 1/c (meaning XjBj remains unchanged). 
- However, in ridge regression, due to shrinkage penalty, scaling the predictor Xj can significantly change both the estimated parameter Bj and other predictions. 
- Therefore, before applying ridge regression, predictors are standardized to be on the same scale.

- Feature standardization is a preprocessing step in machine learning where the input features are tranformed to have a mean of 0 and a standard deviation of 1. 
- This is typically achieved by subtracting the mean of each feature from its values and then dividing by the standard deviation. 

### Bias-variance tradeoff of the ridge estimator
- The superiority(üstünlük) of ridge regression compared to the method of least squares arises from the inherent trade-off between variance and bias. 
- Ridge regression introduces a regularization parameter, denoted as 𝜆, which control the extent of shrinkage applied to the regression coefficients. 
- As the value of 𝜆 increases, the model's flexibility in fitting the data diminishes (küçültmek). 
- Consequenlty, this decrease in flexibility results in a simultaneous reduction in variance but an increase in bias. 

- Let's notice:
- When the number of predictors, p, is close to the number of observations, n, the method of least squares exhibits high variance -a small change in the traning data can lead to a significant change in the estimated parameters. 
- When p>n, the method of least squares stops working (due to the lack of estimation uniqueness), whereas ridge regression handles this situation well. 

#### How Ridge Regression Balance and Variance 
- 1. High Variance (Overfitting): 
    - When a model has too many parameters or the regularization is too weak (λ is small), it can fit the noise in the training data. 
    - Ridge regression reduces variance by shrinking the model coefficients toward zero, making the model simpler and less sensitive to noise. 
- 2. High Bias (Underfitting):
    - When λ is too large, the penalty term dominates, and the model coefficients are overly shrunk. 
    - This results in high bias as the model fails to capture the true underlying patterns in the data. 
3. Tradedoff Mechanism: 
    - λ controls the tradeoff between bias and variance:
        - Small λ: Lower bias, higher variance. 
        - Large λ: Higher bias, lower variance. 
    - The optimal λ balances the two to minimize the total error (sum of bias squared, variance and irreducible error).

#### Practical Application
- To find the optimal λ, cross-validation is typically used. It splits the dataset into traning and validation sets to evaluate the model's performance for different values of λ, ensuring a balacnced tradeoff between bias and variance.

https://datasciencedecoded.com/posts/8_Ridge_Regression_for_Improved_Predictive_Models