# Logistic Regression 1: Concepts, Evaluation, and Practical Issues
This notebook covers the differences between linear and logistic regression, cost function and optimization, regularization, ROC curve, feature selection, handling imbalanced datasets, and common implementation challenges.

## Q1. Explain the difference between linear regression and logistic regression models. Provide an example of a scenario where logistic regression would be more appropriate.

- **Linear regression** predicts a continuous outcome (e.g., house price) using a linear relationship between input features and the target.
- **Logistic regression** predicts the probability of a binary outcome (e.g., spam vs. not spam) using the logistic (sigmoid) function.

**Example:** Predicting whether a patient has a disease (yes/no) based on medical test results is more appropriate for logistic regression.

## Q2. What is the cost function used in logistic regression, and how is it optimized?

The cost function is **log loss** (binary cross-entropy):

Cost = -[y*log(p) + (1-y)*log(1-p)]

where y is the true label and p is the predicted probability. The cost is minimized using optimization algorithms such as gradient descent.

In [None]:
# Example: Logistic regression with scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X, y)
print('Coefficients:', logreg.coef_)

## Q3. Explain the concept of regularization in logistic regression and how it helps prevent overfitting.

Regularization adds a penalty to the cost function to discourage large coefficients, helping to prevent overfitting. Common types:
- **L1 (Lasso):** Encourages sparsity (some coefficients become zero).
- **L2 (Ridge):** Shrinks coefficients but keeps all features.

Regularization strength is controlled by the parameter C (inverse of regularization strength) in scikit-learn.

In [None]:
# Example: Logistic regression with L1 and L2 regularization
logreg_l1 = LogisticRegression(penalty='l1', solver='liblinear')
logreg_l2 = LogisticRegression(penalty='l2')
logreg_l1.fit(X, y)
logreg_l2.fit(X, y)
print('L1 coefficients:', logreg_l1.coef_)
print('L2 coefficients:', logreg_l2.coef_)

## Q4. What is the ROC curve, and how is it used to evaluate the performance of the logistic regression model?

The **ROC curve** (Receiver Operating Characteristic) plots the true positive rate (sensitivity) against the false positive rate at various threshold settings. It helps evaluate the model's ability to distinguish between classes. The area under the ROC curve (AUC) summarizes the model's performance; higher AUC indicates better discrimination.

In [None]:
# Example: ROC curve
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

probs = logreg.predict_proba(X)[:, 1]
fpr, tpr, thresholds = roc_curve(y, probs)
auc = roc_auc_score(y, probs)
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## Q5. What are some common techniques for feature selection in logistic regression? How do these techniques help improve the model's performance?

- **L1 regularization (Lasso):** Automatically selects features by shrinking some coefficients to zero.
- **Recursive Feature Elimination (RFE):** Iteratively removes least important features.
- **Univariate statistical tests:** Select features based on statistical significance.

These techniques reduce overfitting, improve interpretability, and may enhance model accuracy.

In [None]:
# Example: Feature selection with RFE
from sklearn.feature_selection import RFE

selector = RFE(logreg, n_features_to_select=1)
selector.fit(X, y)
print('Feature ranking:', selector.ranking_)

## Q6. How can you handle imbalanced datasets in logistic regression? What are some strategies for dealing with class imbalance?

- **Resampling:** Oversample minority class or undersample majority class.
- **Class weights:** Use class_weight='balanced' in scikit-learn.
- **Synthetic data:** Use SMOTE or similar techniques to generate synthetic samples.
- **Evaluation metrics:** Use metrics like precision, recall, F1-score, and AUC instead of accuracy.

In [None]:
# Example: Using class weights
logreg_balanced = LogisticRegression(class_weight='balanced')
logreg_balanced.fit(X, y)
print('Coefficients (balanced):', logreg_balanced.coef_)

## Q7. Can you discuss some common issues and challenges that may arise when implementing logistic regression, and how they can be addressed? For example, what can be done if there is multicollinearity among the independent variables?

**Common issues:**
- **Multicollinearity:** Use regularization (L1/L2), remove or combine correlated features, or use dimensionality reduction (PCA).
- **Non-linearity:** Add interaction or polynomial terms, or use non-linear models.
- **Outliers:** Remove or transform outliers.
- **Imbalanced data:** Use resampling or class weights.
- **Missing values:** Impute missing data before modeling.