# Assignment Answers

# 1.

##### Part-1:<br><br>
- Linear regression is used to predict continuous numerical values, such as predicting the price of a house based on its features like size, number of rooms, location, etc. The model tries to fit a line or hyperplane through the data points to minimize the sum of the squared errors between the predicted and actual values.

- On the other hand, logistic regression is used to predict binary categorical outcomes, such as whether a customer will buy a product or not, based on certain features like age, gender, income, etc. The model tries to fit a logistic function (S-shaped curve) through the data points to estimate the probability of the outcome being true or false.
<br><br>
##### Part-2:<br><br>
An example of a scenario where logistic regression would be more appropriate than linear regression is in predicting whether a patient has a certain disease or not based on their medical history, age, sex, and other relevant features. The outcome of interest is binary (either the patient has the disease or not), and we want to estimate the probability of having the disease based on the available features. In this case, a logistic regression model would be more appropriate as it can model the probability of the outcome being true or false based on the input features, while a linear regression model would not be suitable as it can only predict continuous numerical values.

# 2.

- The cost function used in logistic regression is the binary cross-entropy or log loss function, which measures the difference between the predicted probability and the actual binary class label of each data point. 
- The formula for the cost function is as follows:
<br><br>
J(θ) = -1/m * ∑ [y*log(h(x)) + (1-y)*log(1-h(x))]
<br>
where:
<br><br>
J(θ) is the cost function<br>
θ are the model parameters<br>
m is the number of training examples<br>
y is the actual binary class label (0 or 1)<br>
h(x) is the predicted probability of the positive class (i.e., y=1)<br><br>
- The goal of logistic regression is to minimize the cost function by finding the optimal values of the model parameters. 
- This is typically done using an optimization algorithm such as gradient descent or a variant of it, where the gradient of the cost function with respect to the model parameters is calculated and the parameters are updated in the opposite direction of the gradient to minimize the cost function iteratively.

- The optimization algorithm seeks to find the values of the model parameters that maximize the likelihood of observing the training data, given the model. 
- This is equivalent to minimizing the negative log-likelihood, which is mathematically equivalent to the binary cross-entropy cost function.

# 3.

- In logistic regression, regularization is a technique used to prevent overfitting of the model. Overfitting occurs when the model is too complex and fits the training data too well, resulting in poor performance on new data.

- There are two types of regularization techniques used in logistic regression: L1 regularization and L2 regularization.

1. L1 regularization, also known as Lasso regularization, adds a penalty term to the cost function that is proportional to the absolute value of the model coefficients. This technique encourages the model to set some coefficients to exactly zero, effectively performing feature selection and reducing the number of features used in the model.

2. L2 regularization, also known as Ridge regularization, adds a penalty term to the cost function that is proportional to the square of the model coefficients. This technique shrinks the coefficients towards zero but does not set them exactly to zero, so all features are retained in the model but with reduced weights.
<br><br>
Regularization helps prevent overfitting by reducing the complexity of the model, which in turn reduces the variance of the model. By reducing the variance, the model becomes less sensitive to small fluctuations in the training data, and it performs better on new data.

The amount of regularization applied to the model is controlled by a regularization parameter, which determines the trade-off between the fit to the training data and the complexity of the model. A larger regularization parameter results in a simpler model with smaller coefficients and less overfitting, but with reduced accuracy on the training data. Conversely, a smaller regularization parameter results in a more complex model with larger coefficients and better accuracy on the training data, but with increased risk of overfitting. The optimal value of the regularization parameter can be found using techniques such as cross-validation.

# 4.

##### Part-1:<br><br>
- The ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model, including logistic regression. 
- It is a plot of the true positive rate (sensitivity) against the false positive rate (1-specificity) for different threshold values.

- To create an ROC curve for a logistic regression model, we first calculate the predicted probabilities of the positive class (1) for each observation in the test dataset. 
- We then vary the classification threshold from 0 to 1 and calculate the true positive rate and false positive rate for each threshold. 
- These values are plotted on a graph with the false positive rate on the x-axis and the true positive rate on the y-axis.

- The closer the ROC curve is to the upper left corner, the better the performance of the model, with an area under the curve (AUC) of 1 indicating a perfect model, while an AUC of 0.5 indicates a random model. 
- The AUC value provides an overall measure of model performance, with a higher value indicating a better model.
<br><br>
##### Part-2:<br><br>
- The ROC curve is a useful tool for evaluating the trade-off between sensitivity and specificity, which is important in many classification problems, such as medical diagnosis or credit risk assessment. 
- By adjusting the threshold value, we can control the balance between these two metrics and choose the threshold that best suits our application.

# 5.

##### Part-1:<br><br>
There are several common techniques for feature selection in logistic regression, including:

1. Forward selection: 
- This approach starts with a single feature and then adds additional features to the model one by one, evaluating the model's performance at each step. 
- The process stops when the desired level of performance is achieved.

2. Backward elimination: 
- This approach starts with all features in the model and then removes one feature at a time, evaluating the model's performance at each step. 
- The process stops when the desired level of performance is achieved.

3. Regularization: 
- Regularization is a technique that adds a penalty term to the cost function to discourage overfitting. 
- Two commonly used forms of regularization are L1 regularization (Lasso) and L2 regularization (Ridge).

4. Recursive feature elimination: 
- This approach starts with all features in the model and then iteratively removes the feature with the lowest importance score, based on a given metric (e.g., coefficient value, p-value, information gain), until the desired number of features is reached.
<br><br>
##### Part-2:<br><br>
- These techniques help improve the model's performance by reducing the number of irrelevant or redundant features in the model, which can lead to overfitting and reduced generalization performance. 
- By selecting only the most important features, the model can better capture the underlying relationships between the input features and the target variable.

# 6.

Imbalanced datasets are a common problem in logistic regression when the number of observations in one class is significantly larger or smaller than the other class. Here are some common strategies for handling imbalanced datasets in logistic regression:

1. Resampling techniques: 
- Resampling techniques involve either undersampling the majority class or oversampling the minority class to balance the dataset. 
- Random undersampling can be used to remove some observations from the majority class, while random oversampling can be used to duplicate some observations from the minority class. 
- Synthetic oversampling can also be used to generate new synthetic observations based on the existing minority class.

2. Class weighting: 
- This technique involves assigning different weights to different classes in the logistic regression algorithm. 
- The weight assigned to the minority class is increased to make it more important than the majority class, which helps to improve the sensitivity of the model towards the minority class.

3. Threshold adjustment: 
- The probability threshold for classification can be adjusted to achieve the desired balance between precision and recall. Lowering the threshold can increase sensitivity and recall but may reduce precision, while increasing the threshold can increase precision but may reduce sensitivity and recall.

4. Cost-sensitive learning: 
- This approach involves assigning different misclassification costs to different classes based on their importance. The misclassification cost of the minority class is higher, which forces the model to prioritize the correct classification of the minority class.

# 7.

Here are some common issues and challenges that may arise when implementing logistic regression and some ways to address them:

1. Multicollinearity: 
- When two or more independent variables are highly correlated with each other, it can cause instability in the logistic regression coefficients and make it difficult to interpret the results. One way to address multicollinearity is to use a regularization technique such as Ridge or Lasso regression.

2. Outliers: 
- Outliers can have a significant impact on the logistic regression model's coefficients and can lead to overfitting. 
- One way to address outliers is to remove them or transform them using a robust regression technique.

3. Missing Data: 
- Missing data can be a problem in logistic regression, as it can lead to biased estimates and reduced model performance. 
- One way to address missing data is to impute the missing values using a method such as mean imputation or multiple imputation.

4. Nonlinearity: 
- Logistic regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable. 
- If this assumption is not met, it can lead to biased estimates and reduced model performance. One way to address nonlinearity is to use polynomial terms or other nonlinear transformations of the independent variables.

5. Overfitting: 
- Logistic regression can easily overfit the data if the model is too complex or if there are too many independent variables. 
- One way to address overfitting is to use a regularization technique such as Ridge or Lasso regression or to use a feature selection method to select only the most important variables.

6. Class Imbalance: 
- If there is a class imbalance in the data, with one class significantly more common than the other, it can lead to biased estimates and reduced model performance. 
- One way to address class imbalance is to use techniques such as oversampling or undersampling to balance the classes.

7. Model Selection: 
- Logistic regression is often used as part of a larger modeling process, and there may be many different models to choose from. 
- One way to address model selection is to use techniques such as cross-validation to compare the performance of different models and select the best one.