## Q1. Explain the difference between linear regression and logistic regression models. Provide an example of a scenario where logistic regression would be more appropriate.


<table>
<tr> 
<th>Sl.No.</th>
<th>Linear Regression</th>
<th>Logistic Regression</th>
</tr>

<tr> 
<td>Penalty Type</td>
<td>L2 (squared magnitude of coefficients)</td>
<td>L1 (absolute magnitude of coefficients)</td>
</tr>

<tr> 
<td>Coefficient Shrinkage</td>
<td>Shrinks coefficients but doesn’t force them to zero</td>
<td>Can shrink some coefficients to exactly zero</td>
</tr>

<tr> 
<td>Feature Selection</td>
<td>Does not perform feature selection</td>
<td>Performs feature selection by zeroing out some coefficients</td>
</tr>


<tr> 
<td>Solution Path</td>
<td>Coefficients are generally non-zero</td>
<td>Can have many coefficients exactly zero</td>
</tr>

<tr> 
<td>Model Complexity</td>
<td>Tends to include all features in the model</td>
<td>Can simplify the model by excluding some features</td>
</tr>

<tr> 
<td>Impact on Prediction</td>
<td>Tends to handle multicollinearity well</td>
<td>Can simplify the model which might improve prediction for high-dimensional data</td>
</tr>

<tr> 
<td>Interpretability</td>
<td>Less interpretable since all features remain in the model.</td>
<td>More interpretable because it automatically eliminates irrelevant features.</td>
</tr>

<tr> 
<td>Best for</td>
<td>Useful when all features are relevant and there’s multicollinearity.</td>
<td>Best when the number of predictors is high, and you need to identify the most significant features.</td>
</tr>

<tr> 
<td>Bias and Variance Tradeoff</td>
<td>Adds some bias but helps reduce variance.</td>
<td>Similar to Ridge, but potentially more bias due to feature elimination.</td>
</tr>

<tr> 
<td>Computation</td>
<td>Generally faster as it doesn’t involve feature selection</td>
<td>May be slower due to the feature selection process</td>
</tr>

</table>

## 2. What is the cost function used in logistic regression, and how is it optimized?


A cost function is a mathematical function that calculates the difference between the target actual values (ground truth) and the values predicted by the model. A function that assesses a machine learning model’s performance also referred to as a loss function or objective function. Usually, the objective of a machine learning algorithm is to reduce the error or output of cost function.

When it comes to Linear Regression, the conventional Cost Function employed is the Mean Squared Error. The cost function (J) for m training samples can be written as:

![image.png](attachment:image.png)

where,

* $ y^i $ is the actual value of the target variable for the i-th training example.
* $ z^i = h_0 (x^i) $ is the predicted value of the target variable for the i-th training example, calculated using the linear regression model with parameters θ.
* $ x^i $ is the i-th training example.
* m is number of training examples.


<b><u>How it is Optimized</b></u>:

A common way to estimate coefficients is to use gradient descent. In gradient descent, the goal is to minimize the Log-Loss cost function over all samples. This method involves selecting initial parameter values, and then updating them incrementally by moving them in the direction that decreases the loss. At each iteration, the parameter value is updated by the gradient, scaled by the step size (otherwise known as the learning rate). The gradient is the vector encompassing the direction and rate of the fastest increase of a function, which can be calculated using partial derivatives. The parameters are updated in the opposite direction of the gradient by the step size in an attempt to find the parameter values that minimize the Log-Loss.

Because the gradient calculates where the function is increasing, going in the opposite direction leads us to the minimum of our function. In this manner, we can repeatedly update our model's coefficients such that we eventually reach the minimum of our error function and obtain a sigmoid curve that fits our data well.



## Q3. Explain the concept of regularization in logistic regression and how it helps prevent overfitting.


<b><u>Reguralization in Logistic Regression</u></b>:

Regularization is a technique used to avoid overfitting in machine learning models. It does this by adding a penalty term to the objective function (also called the loss function or error function) that the model is trying to minimize.

The objective function measures the error or difference between the predicted output of the model and the true output. In logistic regression, the objective function is typically the cross-entropy loss, which measures the difference between the predicted probability of the positive class and the true label (1 or 0).

By adding a penalty term to the objective function, regularization helps to reduce the complexity of the model and prevent it from fitting the training data too closely. The penalty term is a hyperparameter that controls the strength of the regularization. A higher value for the penalty term leads to stronger regularization and a simpler model, while a lower value allows the model to be more complex.


<b><u>How Reguralization Helps Prevent Overfitting</u></b>:

* Reducing the complexity of the model by forcing the coefficients to be small

   ** By adding a penalty term to the objective function, regularization forces the coefficients (weights) of the model to be small. This reduces the complexity of the model and makes it less prone to overfitting.
    
   ** For example, if you have a model with a large number of features and a large number of parameters (coefficients), regularization can help to reduce the number of non-zero coefficients and simplify the model.


* Improving generalization by reducing the variance of the model

  ** Regularization can also help to reduce the variance of the model, which is the amount of error or noise in the model’s predictions.

  ** A model with high variance is sensitive to small changes in the training data and is more likely to overfit. By reducing the variance of the model, regularization can improve the generalization of the model to new, unseen data.


More info: https://medium.com/@rithpansanga/logistic-regression-and-regularization-avoiding-overfitting-and-improving-generalization-e9afdcddd09d

## Q4. What is the ROC curve, and how is it used to evaluate the performance of the logistic regression model?


<b><u>What is ROC Curve</u></b>:

ROC curves were first employed during World War 2 to analyze radar signals: After missing the Japanese aircraft that carried out the attack on Pearl Harbor, the US wanted their radar receiver operators to better identify aircraft from signal noise (e.g. clouds). The operator's ability to identify as many true positives as possible while minimizing false positives was named the Receiver Operating Characteristic, and the curve analyzing their predictive abilities was called the ROC Curve. Today, ROC curves are used in a number of contexts, including clinical settings (to assess the diagnostic accuracy of a test) and machine learning (the focus of this article).

In machine learning, we use ROC Curves to analyze the predictive power of a classifier: they provide a visual way to observe how changes in our model’s classification thresholds affect our model’s performance. Similar to their original use in the 1940s, the curves allow us to select for classification thresholds that allow our model to identify as many true positives as possible while minimizing false positives.


In particular, the ROC curve is composed by plotting a model's True-Positive Rate (TPR) versus its False-Positive Rate (FPR) across all possible classification thresholds, where:

* True Positive Rate (TPR): The probability that a positive sample is correctly predicted in the positive class. E.g., the percentage of radar signals predicted to be airplanes that actually are airplanes.

* False Positive Rate (FPR): The probability that a negative sample is incorrectly predicted in the positive class. E.g., the percentage of radar signals predicted to be airplanes that actually are not airplanes.

<br>


<b><u>How ROC used to evaluate</u></b>:


Most probably you have read about ROC curves for medical diagnostic test. But how can ROC curve itself be used as a diagnostic tool for logistic regression (LR) performance?

You used LR because you have a binary response variable, whose observed values usually coded as (0,1). However, LR produce a “predicted” probability for outcome occurrence for each observed value of the response variable. Probability values range between (0 and 1). So, each “observed” value (0 or 1) has a corresponding “predicted” value (0 >>1).

An important concept is the “classification cut-off”, which determine the predicted value threshold that is when exceeded the “predicted” response is classified as success. For example, set a threshold as 0.5. So, for predicted probability > 0.5, the “predicted” response would be 1, but it is not necessary that the actual observed response is also 1 (it may be 0). The million-dollar question is how to pick the right threshold?!

You probably noticed that ROC curve has two axes (horizontal one for specificity, and a vertical one for sensitivity).

Mathematically:

Sensitivity = (number correctly predicted 1s)/(total number observed 1s)

Specificity = (number correctly predicted 0s)/(total number observed 0s)

What ROC curve actually does for you is that it screens each possible cut-off value that result in changing the classification (0 or 1) and put it as dot in the plot. The location of that dot is plotted as the sensitivity at that cut-off value on the Y axis, and 1-specificity at that cut-off value on the X axis. Note that depending on the software you use, x axis may be (specificity) and in such case the x axis values will be arranged in a descending manner as demonstrated in the attached ROC curve.

You may aim for high sensitivity (true positive), but this may come on the expense of its specificity (true negative). You may dream of high sensitivity and specificity, but unfortunately this is not realistic. Actually, in this imaginary case, you don’t need a model to predict responses as they are highly separated. Depending on your case, you may need to pick high a cut-off with high sensitivity if you can’t take the risk of accepting false negatives. Similarly, if you can’t afford false positives, you should consider a cut-off point with high specificity.

Area under the curve (AUC) is a summary statistic that range between (0.5 and 1). Although it is debatable, AUC indicates how good the LR model in correctly predicting positive and negative outcomes (i.e 0 and 1). The larger the AUC, the better the LR model is. For example, AUC for the shown ROC curve is (0.89). This means that the model is better than flipping a coin(i.e 50% chance) in predicting the outcome by 39%. Statistical tests are there to determine if this is a significant difference!

To sum up, ROC curve in logistic regression performs two roles: first, it help you pick up the optimal cut-off point for predicting success (1) or failure (0). Second, it may be a useful indicator for model performance through checking the ROC curve AUC. By the way, this is how (and why) logistic regression can be used as a “classification” tool.


<br>

Source: https://mlu-explain.github.io/roc-auc/

Source: https://www.linkedin.com/pulse/roc-curve-logistic-regression-hossam-mohamed-b-pharm-mph

## Q5. What are some common techniques for feature selection in logistic regression? How do these techniques help improve the model's performance?


Just like in a linear regression model we can employ feature selection methods that can help us either do one of the following.

* Efficiently determine which explanatory variables should be included (or left out) in the model. Like with linear regression models, these include the following types of techniques.

   ** Backwards Elimination Algorithms

   ** Forward Selection Algorithms

   ** Model Regularization

* Determine how to better represent the full features matrix $ X $ (ie. all of the possible $ p $ explanatory variables) in a different way that still preserves important aspects of $X$. Like with a linear regression model, this includes the following type of technique:

   ** Principal Component Regression

<b><u>How Feature Selection improve Model's Performance</u></b>:

Feature selection aims to obtain an excellent feature subset with as fewer features as possible from the original feature space, which could achieve better classification performance [1]. Feature selection can reduce data redundancy to avoid the loss of precision and the waste of computing resources caused by excessively high dimensions [2]. More importantly, the original physical meaning of features after dimensionality reduction could be retained, so it can support feature selection during data collection. The technology has been applied to computer vision [3], medical [4], remote sensing [5] and other fields [6], [7].


Source: https://www.sciencedirect.com/science/article/abs/pii/S0925231223003910#:~:text=Feature%20selection%20can%20help%20to,square%20loss%20and%20hinge%20loss.

## Q6. How can you handle imbalanced datasets in logistic regression? What are some strategies for dealing with class imbalance?


<b><u>How weighted Logistic Regression is used for an Imbalanced Dataset?</u></b>

Weighted logistic regression is a technique commonly employed to address the issue of imbalanced datasets in logistic regression models. In imbalanced datasets, where the classes of interest are not equally represented, traditional logistic regression models may exhibit bias towards the majority class, leading to suboptimal performance, especially for predicting rare events.

Here's how weighted logistic regression works and how it can be used to handle imbalanced datasets:

1. <b>Understanding Imbalanced Datasets</b>: In imbalanced datasets, one class (majority class) is significantly more prevalent than the other class(es) (minority class). For instance, in a medical dataset, the number of healthy patients might outnumber the number of patients with a rare disease by a large margin.

2. <b>The Problem with Traditional Logistic Regression</b>: Traditional logistic regression treats all classes equally during model training. Consequently, when faced with imbalanced datasets, the model tends to be biased towards the majority class. As a result, it may have lower sensitivity (true positive rate) for the minority class, leading to poor performance in predicting rare events.
3. <b>Weighted Logistic Regression</b>: Weighted logistic regression addresses this issue by assigning different weights to each class based on their prevalence in the dataset. The weights are incorporated into the loss function during model training. By assigning higher weights to the minority class and lower weights to the majority class, the model is encouraged to pay more attention to the minority class, thereby reducing the bias towards the majority class.
4. <b>Training the Weighted Logistic Regression Model</b>: During model training, the weighted logistic regression algorithm adjusts the model parameters to minimize the weighted sum of errors, where errors from the minority class are given higher weights. This encourages the model to focus on correctly classifying instances from the minority class, improving its ability to predict rare events.
5. <b>Evaluation and Fine-Tuning</b>: After training, the weighted logistic regression model is evaluated using appropriate performance metrics, such as precision, recall, F1-score, and area under the ROC curve (AUC-ROC), considering the imbalanced nature of the dataset. Depending on the evaluation results, the model may be fine-tuned further by adjusting the class weights or other hyperparameters to achieve better performance.
6. <b>Applications</b>: Weighted logistic regression is widely used in various domains, including healthcare, finance, fraud detection, and anomaly detection, where correctly identifying rare events or minority classes is crucial.


For more information: https://www.geeksforgeeks.org/weighted-logistic-regression-for-imbalanced-dataset/

<br>

<b><u>What are the best ways to handle class imbalance in logistic regression?</u></b>:

1. <b>Data Sampling</b>:
One way to deal with class imbalance is to adjust the distribution of the data by sampling techniques. There are two main types of sampling: oversampling and undersampling. Oversampling involves adding more copies of the minority class to the data, while undersampling involves removing some observations from the majority class. Both methods aim to create a balanced dataset that reflects the true proportions of the classes. However, oversampling may cause overfitting, while undersampling may lose valuable information. Therefore, you should use sampling with caution and test different ratios to find the optimal balance.

2. <b>Weighted Loss Function</b>:
Another way to handle class imbalance is to modify the loss function of the logistic regression model. The loss function measures how well the model fits the data and guides the optimization process. By default, the loss function assigns equal weights to all observations, regardless of their class. This means that the model is more influenced by the majority class and may ignore the minority class. To overcome this, you can use a weighted loss function that assigns higher weights to the minority class and lower weights to the majority class. This way, the model will pay more attention to the minority class and reduce the bias.

3. <b>Synthetic Data Generation</b>:
A third way to deal with class imbalance is to generate synthetic data for the minority class using algorithms such as SMOTE (Synthetic Minority Oversampling Technique). SMOTE creates new observations for the minority class by interpolating between existing ones. This helps to increase the diversity and size of the minority class, without duplicating or discarding any data. However, synthetic data may not capture the true characteristics and variability of the minority class, and may introduce noise or outliers. Therefore, you should use synthetic data generation with care and validate the results.

4. <b>Model Evaluation Metrics</b>:
A final way to handle class imbalance is to choose appropriate metrics to evaluate the performance of the logistic regression model. The default metric for logistic regression is accuracy, which measures the percentage of correct predictions. However, accuracy can be misleading when dealing with class imbalance, as it may favor the majority class and ignore the minority class. For example, if 90% of the data belongs to class A and 10% to class B, a model that always predicts class A will have 90% accuracy, but zero sensitivity for class B. Therefore, you should use other metrics that take into account the balance and quality of the predictions, such as precision, recall, F1-score, ROC curve, and AUC.

Source: https://www.linkedin.com/advice/3/what-best-ways-handle-class-imbalance-logistic-regression-yankf#:~:text=Noor%20Mahammad-,The%20best%20ways%20to%20handle%20class%20imbalance%20in%20logistic%20regression,class%20weights%20to%20penalize%20misclassifications.

## Q7. Can you discuss some common issues and challenges that may arise when implementing logistic regression, and how they can be addressed? For example, what can be done if there is multicollinearity among the independent variables?


<b><u>Challenges in Logistic Regression</u></b>:

Logistic regression faces challenges such as multicollinearity, overfitting, and assuming a linear relationship between predictors and outcome log-odds. These issues can lead to unstable coefficient estimates, overfitting, and difficulty generalizing the model to new data. Additionally, the assumption may not always be true in practice. 

Common Challenges Faced in Logistic Regression:

<u>Imbalanced datasets</u>:
Imbalanced datasets lead to biased predictions towards the majority class and result in inaccurate evaluations for the minority class. This disparity in class representation hampers the model's ability to properly account for the less-represented group, affecting its overall predictive performance.

<u>Multicollinearity</u>:
Multicollinearity arises from highly correlated predictor variables, making it difficult to determine the individual effects of each variable on the outcome. The strong interdependence among predictors further complicates the modeling process, impacting the reliability of the logistic regression analysis. 

  * Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might be unable to trust the p-values to identify statistically significant independent variables.

<u>Overfitting</u>:
Overfitting occurs when the model becomes overly complex and starts fitting noise in the data rather than capturing the underlying patterns. This complexity reduces the model's ability to generalize well to new data, resulting in a decrease in overall performance.


<b><u>How to Resolve Multicollinearity in Logictic Regression</u></b>:


In Python, there are several ways to detect multicollinearity in a dataset, such as using the Variance Inflation Factor (VIF) or calculating the correlation matrix of the independent variables. To address multicollinearity, techniques such as regularization or feature selection can be applied to select a subset of independent variables that are not highly correlated with each other.

  * VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.
 
    or

  * VIF score of an independent variable represents how well the variable is explained by other independent variables.

  $R^2$ value is determined to find out how well an independent variable is described by the other independent variables. A high value of $R^2$ means that the variable is highly correlated with the other variables. This is captured by the VIF, which is denoted below:

  ![image.png](attachment:image.png)

  So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

  * VIF starts at 1 and has no upper limit
  * VIF = 1, no correlation between the independent variable and the other variables
  * VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

  Example:

  ![image-2.png](attachment:image-2.png)

  We can see here that the ‘Age’ and ‘Years of service’ have a high VIF value, meaning they can be predicted by other independent variables in the dataset.

  Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features:

  ![image-3.png](attachment:image-3.png)

  The image on the left contains the original VIF value for variables, and the one on the right is after dropping the ‘Age’ variable. We were able to drop the variable ‘Age’ from the dataset because its information was being captured by the ‘Years of service’ variable. This has reduced the redundancy in our dataset. Dropping variables should be an iterative process starting with the variable having the largest VIF value because other variables highly capture its trend. If you do this, you will notice that VIF values for other variables would have reduced, too, although to a varying extent.

  In our example, after dropping the ‘Age’ variable, VIF values for all variables have decreased to varying degrees.

  ![image-4.png](attachment:image-4.png)

  The image on the left contains the original VIF value for variables, and the one on the right is after combining the ‘Age’ and ‘Years of service’ variables. Combining ‘Age’ and ‘Years of experience’  into a single variable, ‘Age_at_joining’ allows us to capture the information in both variables.

However, multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the following reasons:

When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

How to Interpret MultiCollinearity in Spss?
To interpret MultiCollinearity in Spass here are Some Points:

VIF (Variance Inflation Factor) • Where: Coefficients table • Problem if: VIF > 5-10 • Higher VIF = More multicollinearity
Tolerance • Where: Coefficients table • Problem if: < 0.1 • Lower tolerance = More multicollinearity
Condition Index • Where: Collinearity Diagnostics table • Caution if: 15-30 • Problem if: > 30
Variance Proportions • Where: Collinearity Diagnostics table • Problem if: Multiple variables > 0.5 on same row
Correlation Matrix • Problem if: Correlations > 0.8 between variables
Conclusion
We learned how the problem of multicollinearity could occur in regression models when two or more independent variables in a data frame have a high correlation with one another. Its presence can cause the regression coefficients to become unstable and difficult to interpret, which can lead to wide confidence intervals and increased variability in the predicted values of the dependent variable. Understanding what causes it and how to detect and fix it can help us to overcome these problems.

In this article, we explored how the Variance Inflation Factor (VIF) can be used to detect the existence of multicollinearity in our dataset and how to fix the problem by identifying and dropping the correlated variables. Remember, when assessing the statistical significance of predictor variables in a regression model, it is important to consider their individual coefficients and their standard errors, p-values, and confidence intervals. Predictor variables with high multicollinearity may have inflated standard errors and p-values, which can lead to incorrect conclusions about their statistical significance.

Hope you like the article on multicollinearity in regression! So, what is multicollinearity? It occurs when independent variables are closely related, making it hard to see their individual effects. This can inflate standard errors and lead to unreliable results. The Variance Inflation Factor (VIF) is a useful tool to check for multicollinearity, helping you decide if you need to adjust your variables for clearer analysis

If you want to understand other regression models or want to understand model interpretation, I highly recommend going through the following wonderfully written articles:

Regression Modeling:
* Machine Learning Model Interpretability
* As a next step, you should also check out the Fundamentals of Regression (free) course.

Key Takeaways:
* Multicollinearity occurs when two or more independent variables have a high correlation with one another in a regression model, which makes it difficult to determine the individual effect of each independent variable on the dependent variable.
* Multicollinearity can occur due to poorly designed experiments, highly observational data, creating new variables that are dependent on other variables, including identical variables in the dataset, inaccurate use of dummy variables, or insufficient data.
* One method to detect multicollinearity is to calculate the variance inflation factor (VIF) for each independent variable, and a VIF value greater than 1.5 indicates multicollinearity.
* To fix multicollinearity, one can remove one of the highly correlated variables, combine them into a single variable, or use a dimensionality reduction technique such as principal component analysis to reduce the number of variables while retaining most of the information.
* So in this article you will be get the analysis of multicollinearity meaning and how multicollinearity in regression will make the detection with VIF for multicollinearity.