Q1. What is Lasso Regression, and how does it differ from other regression techniques?

Ans: This is a regularization technique used in feature selection using a Shrinkage method also referred to as the penalized regression method. Lasso is short for Least Absolute Shrinkage and Selection Operator, which is used both for regularization and model selection.

The ridge coefficients are a reduced factor of the simple linear regression coefficients and thus never attain zero values but very small values. The lasso coefficients become zero in a certain range and are reduced by a constant factor, which explains their low magnitude in comparison to the ridge.

Q2. What is the main advantage of using Lasso Regression in feature selection?

Ans: The main advantage of a LASSO regression model is that it has the ability to set the coefficients for features it does not consider interesting to zero. This means that the model does some automatic feature selection to decide which features should and should not be included on its own.

Q3. How do you interpret the coefficients of a Lasso Regression model?

Ans: The coefficient of the term represents the change in the mean response for one unit of change in that term. If the coefficient is negative, as the term increases, the mean value of the response decreases. If the coefficient is positive, as the term increases, the mean value of the response increases.

some of the coefficients will be set to zero, which means that the corresponding feature has been excluded from the model. The non-zero coefficients represent the features that are most important for predicting the target variable.

Q4. What are the tuning parameters that can be adjusted in Lasso Regression, and how do they affect the
model's performance?

Ans: A tuning parameter (λ), sometimes called a penalty parameter, controls the strength of the penalty term in ridge regression and lasso regression. It is basically the amount of shrinkage, where data values are shrunk towards a central point, like the mean.Shrinkage results in simple, sparse models which are easier to analyze than high-dimensional data models with large numbers of parameters.

* When λ = 0, no parameters are eliminated. The estimate is equal to the one found with linear regression.
* As λ increases, more and more coefficients are set to zero and eliminated.
* When λ = ∞, all coefficients are eliminated.

There is a trade-off between bias and variance in resulting estimators. As λ increases, bias increases and as λ decreases, variance increases. For example, setting your tuning parameter to a low value results in a more manageable number of model parameters and lower bias, but at the expense of a much larger variance.



Q5. Can Lasso Regression be used for non-linear regression problems? If yes, how?

Ans: If you can linearize the model, then yes but for an approximate solution in the LS sense since what is measured is y and not any of its possible transforms. If you model is nonlinear because of one parameter, there are things which can be done.

The essential part of LASSO is just adding an L1 norm of the coefficients to the main term,

f(x,y,β)+λ∥β∥1.

There's no reason 'f' has to be a linear model. It may not have an analytic solution, or be convex, but there's nothing stopping you from trying it out, and it should still induce sparsity, contingent on a large enough lambda.

Q6. What is the difference between Ridge Regression and Lasso Regression?

Ans:
* Ridge Regression :
In Ridge regression, we add a penalty term which is equal to the square of the coefficient. The L2 term is equal to the square of the magnitude of the coefficients. We also add a coefficient  \lambda  to control that penalty term. In this case if  \lambda  is zero then the equation is the basic OLS else if  \lambda \, > \, 0 then it will add a constraint to the coefficient. As we increase the value of \lambda this constraint causes the value of the coefficient to tend towards zero. This leads to tradeoff of higher bias (dependencies on certain coefficients tend to be 0 and on certain coefficients tend to be very large, making the model less flexible) for lower variance.

![image.png](attachment:bfc1c0e5-8aa1-4295-98dc-1d764524e5d9.png)

* Lasso Regression :
Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds penalty term to the cost function. This term is the absolute sum of the coefficients. As the value of coefficients increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.

![image.png](attachment:13b99462-73ce-4f09-aa02-cde0cc10db8f.png)

Q7. Can Lasso Regression handle multicollinearity in the input features? If yes, how?

Ans: Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

![image.png](attachment:c3f84fa4-38a0-42ae-a13e-73148f5c1a0c.png)

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators.

![image.png](attachment:775d0773-9b97-4aaf-b8aa-fadaefbb1af4.png)

Q8. How do you choose the optimal value of the regularization parameter (lambda) in Lasso Regression?

Ans: The answer is Cross-Validation.

Cross-validation is a way to tune the hyperparameters using only the training data. There are different variations of cross-validation, but the most common one is 10-Fold Cross-Validation.

Remember, data is a limited resource and we have to use it wisely. The main thing to remember here is that we have to keep the test data away from the algorithm and do all the validation only on the training data.

Step 1: First split the entire dataset into training and testing sets. (70%-30% or 80%-20%).

Step 2: Now keeping the test set away, split the training data into 10 equal folds.
![image.png](attachment:18697280-c806-46a7-98bc-12013018fe93.png)

Step 3: Now say we choose lambda = 0.2. Train the model on 9 of the folds and evaluate the model on the holdout fold (which now acts as testing data within the training data) and get the holdout score, which is the performance score of that model. Say we get 0.52
![image.png](attachment:2cb80462-e404-4abe-9ab8-c628c3cf66d4.png)

Step 4: Repeat step 3 for 9 times, each time on a different holdout fold, and record their holdout scores.
![image.png](attachment:6cfafcf3-ea5c-43fe-b1e6-5be6f48c85d0.png)

Step 5: After all the iterations are done, the model would have been trained each time using different folds on 10 different hold-out folds giving 10 different holdout scores. To get the final Cross-Validation score take the average of all the individual holdout scores.
![image.png](attachment:595358af-2147-4715-927c-2773c72a469a.png)

Cross-validation score is the performance of a model using a specific set of hyper parameter values (in this case lambda = 0.2) on that set of data.

Now perform the steps from 1 to 5 for other sets of lambda that you would like to try. Finally, you would get something like this,
![image.png](attachment:ff7b2491-9dab-46ea-8fcb-3f6016b15a29.png)

The best cross-validation score is obtained for the 0.4 value of lambda. This is your optimal value of lambda.