# Theoretical Questions - **Logistic Regression**

### **1. What is Logistic Regression, and how does it differ from Linear Regression?**


**Logistic Regression** is a statistical techniques used for classification dataset(primarily used for *binary*) prediction, it predicts the probability that an observation belongs to a particular class (e.g., 0 or 1). *Unlike Linear Regression, which predicts continuous outcomes* (e.g., house prices), Logistic Regression outputs probabilities bounded between 0 and 1, making it suitable for categorical outcomes.

**Key Differences:**
- **Type of Output:** Linear Regression predicts a *continuous value* (e.g., ( y = -2.5) or ( 100 )), while Logistic Regression predicts a probability (e.g., $( P(y=1) = 0.75 )$), using the *sigmoid function*, and decides the class according to the probability.
- **Model Equation:** Linear Regression uses: $( y = \theta_0 + \theta_1x )$, while Logistic Regression models the log-odds: 
  > $ \log\left(\frac{p}{1-p}\right) = \theta_0 + \theta_1x $ <br> and ultimately the equation gets down to, <br> $y = \frac{1}{1 + {e^{-(\theta_0 + \theta_1 x)}}}$
- **Purpose:** Linear Regression is for regression tasks; Logistic Regression is for classification.
- **Example:** Predicting temperature, prices (Linear) vs. predicting if it will rain (yes/no), stock prices will go up or down. (Logistic).



### **2. What is the mathematical equation of Logistic Regression?**

Logistic Regression models the probability of a binary outcome using the logistic (sigmoid) function. The equation relates the linear combination of features to a probability:

> <br> $P(y=1|x) = \frac{1}{1 + e^{-(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_ nx_n)}} = y$ <br> .
Where:
- $P(y=1|x)$: Probability of the positive class given features \( x \).
- $(\theta_0)$: Intercept.
- $( \theta_1, \theta_2, \dots, \theta_n)$: Coefficients for features $( x_1, x_2, \dots, x_n).$
- $(e)$: Base of the natural logarithm.


**Intuition:** The linear part $( \beta_0 + \beta_1x )$ ['which is actually the simple linear regression] can range from $(-\infty, infty)$, and the sigmoid squashes it into the range [0, 1].

### **3. Why do we use the Sigmoid function in Logistic Regression?**

The Sigmoid function, $\sigma(z) = \frac{1}{1 + e^{-z}}$, is used because it transforms a linear combination of inputs (continuous) into a probability between 0 and 1. This matches the goal of binary classification: estimating $P(y=1)$. Which is the sigmoid function is the main differnce which squashes the continuous curve into 0,1.
 where z is the SLR.

### **4. What is the cost function of Logistic Regression?**

The cost function for Logistic Regression is the **Log Loss** also called **Cross-Entropy Loss**, which measures the difference between predicted probabilities and actual labels. 

> For a binary classification:
$J(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

Where:
- $(n)$: Number of samples.
- $(y_i)$: True label (0 or 1).
- $\hat{y}_i$ = $\sigma(\theta_0 + \theta_1 x_1)$: Predicted probability. And $\sigma$ is the sigmoid function.



### **5. What is Regularization in Logistic Regression? Why is it needed?**

Regularization adds a 'penalty' term to the cost function to reduce overfitting (Using Ridge or L2) or for feature selection (we use Lasso or L1) by discouraging overly complex models (large coefficients).

**Modified Cost Function:**
$J(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] + \lambda_1[\sum_{i}\theta_{i}^{2}] + \lambda{2}[\sum_{i}{|\theta_i|}]$

$\lambda$'s are the controls its strength. This is the regularization for elastic net where first part is reasonable for the ridge and absolute part is for the lasso.

**Why Needed:**
- **Overfitting:** L2 regularization, helps to reduce the overfitting in the model.
- **Feature Selection:** L1 (Lasso) shrinks some coefficients to zero, which might be insignificant.

### **6. Explain the difference between Lasso, Ridge, and Elastic Net regression?**


- **Lasso (L1):**
  - Penalty: $\sum |\theta_j|$
  - Shrinks some coefficients to exactly 0 which are insignificant to the target, which helps in feature selection.


- **Ridge (L2):**
  - Penalty: $\sum \theta_j^2$
  - Shrinks coefficients toward 0 but not exactly 0, helps in reducing overfitting.

- **Elastic Net:**
  - Penalty: $\alpha \sum |\beta_j| + (1 - \alpha) \sum \beta_j^2$ (It's the mix of L1 and L2).
  - Combines feature selection (L1) and coefficient shrinkage (L2).
  - It can be used when features are correlated and we are not sure about which regularization to use. It basically balances both the L1 and L2

### **7. When should we use Elastic Net instead of Lasso or Ridge?**

We can use **Elastic Net** when:
- **Correlated Features:** Lasso might arbitrarily select one feature from a correlated group, while Elastic Net handles this better by combining L1 and L2. Elastic net handles multi-collinearity better.
- Elastic can be useful when we have large number of features so that by regularization we can do both shrink some coefficients and making some of the coefficients to exactly 0.

- And In general elastic net might not be much useful for small datasets.

### **8. What is the impact of the regularization parameter (λ) in Logistic Regression?**

The regularization parameter $(\lambda)$ (or, $C = 1/\lambda$) controls the strength of the penalty:

- **Large $\lambda $ Small C** indicates a Strong regularization.
- Coefficients shrink more (many become 0 with L1).
- Simpler model, less overfitting, but risks of underfitting.

- usually we use $\lambda = 1$ or we might use hypertuning.

### **9. What are the key assumptions of Logistic Regression?**


1. **Binary or categorical Outcome:** While logistic regression is primarily used for binary classification so the target variable should be binary (e.g., 0 or 1). For example predicting pass/fail or yes/no.
2. **Independence:** Each observation should be independent of each other, and not influenced by others. For example, one student’s test score shouldn’t affect another’s.
3. **Linearity in Log-Odds:** The relationship between features and the log-odds (also known as logit) $\log\left(\frac{p}{1-p}\right)$ should be linear. This doesn’t mean the probability itself is linear, but the transformed odds are.
4. **No Multicollinearity:** Features shouldn’t be highly correlated among themselves. If study hours and prep time are nearly identical, it confuses the model about which matters more. (that's where regularization comes in)
5. **Large Sample Size:** We need decent amount of data, especially if one class is rare, that is for imbalanced data and not fixed with proper EDA might lead to misleading inferences.

- Unlike Linear Regression, we don’t need the normality assumption.

### **10. What are some alternatives to Logistic Regression for classification tasks?**

- The most closest alternative to logistic regression can be **Probit regression**, which basically uses the standard normal distribution instead of the bernoulli as a link function.
Some other classification algorithms like:

- **Decision Trees:** Here we split data into branches based on feature thresholds. It can be easy to visualize.
- **Random Forests:** A team of decision trees voting together. They’re robust and handle messy data well, though less interpretable.
- **Support Vector Machines (SVM):** We find the best boundary (hyperplane) to separate classes. With kernels, which might be good to tackle non-linear problems.
- **K-Nearest Neighbors (KNN):** This looks at the closest k data points to decide the target category.
- **Naive Bayes:** This use probability and assumes features are independent. It’s fast for text tasks like spam filtering.
- **Neural Networks:** This mighht be the most advanced algorithm among these and can captures complex patterns (e.g., image recognition), but they need lots of data and tuning.


### **11. What are Classification Evaluation Metrics?**

- **Confusion Matrix:** A table showing the counts for True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
- **Accuracy:** The percentage of correct predictions from all data points. $\frac{TP + TN}{TP + TN + FP + FN}$. A good metric to interpret but we don't solely rely on accuracy only. We need to verify the other metrics too.
- **Precision:** How many predicted positives were actually positive $\frac{TP}{TP + FP}$. It is a key metric when a large FP can be critical in real life scenarios. (e.g., spam filterations).
- **Recall (Sensitivity):** How many actual positives are there from all the predicted. $\frac{TP}{TP + FN}$. It is Critical when False Negatives should be avoided. (e.g., disease detection).
- **F-beta-Score:** Balances precision and recall $(1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$. Perfect for imbalanced data. Where, $\beta = 1 or 0.5 or 2$ when FP and FN both are critical, or FP is more critical or FN is more critical respectively.
- **ROC-AUC:** Plots true vs. false positive rates (TPR V. FPR). A higher area under the curve indicates a better categorization.



### **12. How does class imbalance affect Logistic Regression?**

Class imbalance, meanse when one class is dominating the complete dataset (e.g., 90% no , 10% yes) this can be tricky and bias driven for our Logistic Regression model. Intuitively if we think, even if the model gives all of the output as 'no' the model accuracy will still show approximately 90% which is misleading. So the affects are-

- **Bias Toward Majority:** The model might just predict the majority class (e.g., "no") to look good on accuracy, ignoring the minority.
- **Skewed Probabilities:** Coefficients get tuned to the bigger class, making it less sensitive to the rare one.
- **Cost Function Issue:** Our standard cost fn, <center> <br> $J(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$ </center>, treats all errors equally, so it’s happy optimizing for the majority.

**Fixes:**
- We can add class weights to penalize minority errors more.
- Resampling techniques like Oversample the minority or undersample the majority in EDA.
- we can change the decision threshold (e.g., from 0.5 to 0.3).



### **13. What is Hyperparameter Tuning in Logistic Regression?**

Hyperparameter tuning is the technique to find out the best combination of parameters to use to get the best response from our model. it's like tweaking the settings of our machine to get the best performance. In Logistic Regression, we adjust values set before training, we can use various parameters to tune:

- **C = (Inverse of Regularization Strength):** Controls how much we penalize big coefficients (small C = strong penalty).
- **Penalty:** To find out which regularization working best for our dataset. L1 (Lasso), L2 (Ridge), or Elastic Net.
- **Solver:** The algorithm solving the optimization (e.g., liblinear, saga, lbfgs(default)).

**Methods of Tuning:** Here cross validations are also gets applied, the train data splits further into training data and validation data.
- **Grid Search CV:** Tries every combinations given as parameters in a range (e.g., C = [0.1, 1, 10]).
- **Random Search CV:** This picks random combinations and it works faster for big ranges.



### **14. What are different solvers in Logistic Regression? Which one should be used?**

Solvers are the mechanisms that minimize our cost function. Here’s the rundown:

- **liblinear:** Uses coordinate descent. Good for small datasets and supports L1 or L2 penalties. Doesn't work for multiclass
- **lbfgs:** Default solver in scikit learn. An approximation of Newton’s method. Efficient for medium datasets with L2.
- **newton-cg:** Full Newton’s method with conjugate gradient. Accurate but heavy for L2.
- **sag:** Stochastic Average Gradient. Fast for big datasets with L2.
- **saga:** Like SAG but supports L1, L2, and Elastic Net. Super versatile.


- Logistic Regression Solvers Comparison

| Solver      | Type of Optimization  | Supports L1 | Supports L2 | Multiclass Support | Best Use Case |
|------------|----------------------|-------------|-------------|------------------|---------------|
| **lbfgs**  | Quasi-Newton (2nd order) | ❌ No | ✅ Yes | ✅ Yes (Softmax) | Best for **multiclass classification** |
| **liblinear** | Coordinate Descent (1st order) | ✅ Yes | ✅ Yes | ❌ No (OvR only) | Best for **small datasets, L1 regularization** |
| **saga**   | Stochastic Average Gradient (1st order) | ✅ Yes | ✅ Yes | ✅ Yes (Softmax) | Best for **large datasets & L1 regularization** |
| **sag**    | Stochastic Average Gradient (1st order) | ❌ No | ✅ Yes | ✅ Yes (Softmax) | Best for **large datasets & fast convergence** |
| **newton-cg** | Newton’s Method (2nd order) | ❌ No | ✅ Yes | ✅ Yes (Softmax) | Best for **L2-regularized problems** |

**Which one to use**
- **`lbfgs` (also the default) for most problems (multiclass & L2 regularization).**
- **`saga` for L1 or Elastic Net regularization.**
- **`liblinear` for small datasets with L1 regularization.**
- **`sag` for massive datasets with L2.**
- **`newton-cg` when needed second-order optimization for L2 regularization.**


### **15. How is Logistic Regression extended for multiclass classification?**


Logistic Regression is primarily for **binary classification**, but it can be extended to handle **multiclass classification** using two main methods:

**1️ One-vs-Rest (OvR)**
- Trains **K binary classifiers** (one per class).
- For each classifier, one class is **positive (1)**, and all others are **negative (0)**.
- And The class with the **highest probability** is chosen.
- Works with **all solvers** (e.g., liblinear, lbfgs, saga).
- It's simple but might not be efficient for large datasets since it classifies 1 data point at an iteration.

**2 Softmax Regression (Multinomial Logistic Regression)**
- It uses the concept of Multinomial distribution which is also an extension of the Binomial distribution.
- Directly models **all K classes** using the **Softmax function**.
- Uses a **single model** instead of K binary classifiers.
- This multi_class technique only works with some **solvers supporting multinomial logistic regression** (e.g., lbfgs, saga, newton-cg).

**Mathematical Representation:**
For class \( k \), the probability is:



**Key difference**
| Scenario | Method to use |
|----------|------------------|
| Small datasets or `liblinear` solver | **One-vs-Rest (OvR)** |
| Large datasets, Softmax required | **Multinomial (Softmax Regression)** |



### **16. What are the advantages and disadvantages of Logistic Regression?**


Logistic Regression is a widely used and one of the most popular classification algorithm because it is simple yet effective for many real-world problems. However, there are some limitations as well.


**Advantages of Logistic Regression**
1. **Simple and Interpretable**  
   - Easy to implement and interpret compared to other complex models.
   - The coefficients indicate feature importance.

2. **Probabilistic Output**  
   - Outputs class probabilities instead of just predictions. (We can see the probabilites in sckit learn as well using the model.proba method)
   - It implements the threshold-based decision-making.

3. **Efficient and Fast**  
   - Works well on small to medium-sized datasets.
   - Training and inference are computationally efficient.

4. **Works Well with Linearly Separable Data**  
   - If data is linearly separable, Logistic Regression performs well.

5. **Handles Multiclass Classification**  
   - Supports **One-vs-Rest (OvR)** and **Softmax Regression (multinomial)** for multiclass problems.

6. **Regularization Support (L1 & L2)**  
   - L1 (Lasso) for feature selection.
   - L2 (Ridge) for reducing overfitting.


**Disadvantages of Logistic Regression**
1. **Assumes Linear Decision Boundary**  
   - Cannot model complex relationships in non-linear data.
   - Needs transformation (polynomial features) or kernel tricks.

2. **Sensitive to Outliers**  
   - Outliers can heavily affect parameter estimation.

3. **Requires Feature Engineering**  
   - Needs careful feature selection and scaling for optimal performance.
   - Performance depends on choosing the right independent variables.

4. **Not Suitable for Large Feature Spaces**  
   - Computationally inefficient for high-dimensional data.

5. **Cannot Handle Highly Correlated Features (Multicollinearity)**  
   - Multicollinearity can distort coefficient estimates.
   - We use **feature selection or PCA** to address the issue.


**When to Use Logistic Regression?**
| Scenario | Logistic Regression |
|----------|----------------------|
| Binary Classification (0/1) | Yes |
| Multiclass Classification (OvR/Softmax) | Yes |
| Large, high-dimensional data | No (Consider SVM, Neural Networks) |
| Linearly Separable Data | Yes |
| Complex Non-Linear Relationships | No (Consider Decision Trees, Neural Networks) |
| Need Probabilistic Outputs | Yes |



In [7]:
# ## visualization and example for linearly separable data

# import numpy as np
# import matplotlib.pyplot as plt
# from sklearn.datasets import make_classification, make_moons
# from sklearn.linear_model import LogisticRegression

# # Generate linearly separable data
# X_linear, y_linear = make_classification(n_samples=100, n_features=4, n_clusters_per_class=2, n_classes=2, random_state=42)

# # Generate non-linearly separable data
# X_nonlinear, y_nonlinear = make_moons(n_samples=100, noise=0.2, random_state=42)

# # Plot both datasets
# fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# # Linearly Separable
# ax[0].scatter(X_linear[:, 0], X_linear[:, 1], c=y_linear, cmap='bwr', edgecolors='k')
# ax[0].set_title("Linearly Separable Data")

# # Not Linearly Separable
# ax[1].scatter(X_nonlinear[:, 0], X_nonlinear[:, 1], c=y_nonlinear, cmap='bwr', edgecolors='k')
# ax[1].set_title("Non-Linearly Separable Data")

# plt.show()


### **17. What are some use cases of Logistic Regression?**


Logistic Regression is widely used for **classification tasks** especially **binary** where the goal is to predict discrete categorical outcomes. Some common applications include:

**1️ Medical Diagnosis**
- Predicting diseases (Yes/No. Diabetic/Non-diabetic).
- Classifying patients as **high-risk or low-risk**.

**2️ Fraud Detection or Spam filter**
- Identifying **fraudulent transactions** in banking, (or True caller showing fraud alert)
- Detecting **spam emails** in email filtering.

**3️ Customer Churn Prediction**
- Predicting whether a **customer will leave a service** (e.g., telecom, SaaS platforms).

**4️ Credit Scoring & Loan Approval**
- Banks use it to determine **loan approval** based on customer history.

**5️ Marketing & Ad Click Prediction**
- Predicting whether a user will **click on an ad** (CTR prediction) (or, will the customer buy the product).
- Classifying customers for **targeted marketing campaigns**.

**6️ Sentiment Analysis**
- Classifying text as **positive, neutral, or negative**.
- Common in **social media analysis** and **customer reviews**.



### **18. What is the difference between Softmax Regression and Logistic Regression?**

Logistic Regression and Softmax Regression are both classification algorithms, but they differ in how they handle the **number of classes** as Softmax uses Multinomial which is for Multiclass problems. This is the key difference.

| **Feature**        | **Logistic Regression** | **Softmax Regression** |
|--------------------|------------------------|------------------------|
| **Used For**       | Binary Classification (0 or 1) | Multiclass Classification (3+ classes) |
| **Decision Boundary** | Single threshold at **0.5** | Selects the class with the highest probability |
| **Model Type**     | One decision function for 2 classes | A separate function for each class |
| **Solver Support** | Works with **all solvers** | Needs `lbfgs`, `saga`, or `newton-cg` |

- **we use Logistic Regression** when there are **only two classes (binary classification)**.
- **we use Softmax Regression** when there are **multiple classes (multiclass classification)**.




### **19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?**


Both **One-vs-Rest (OvR)** and **Softmax (Multinomial Logistic Regression)** are used for **multiclass classification**, but their choice depends on dataset size, solver compatibility, and efficiency.

**One-vs-Rest (OvR)**
- Trains **K binary classifiers** (one for each *class vs. the rest*).
- The class with the **highest probability** wins.
> **Advantages:**  
- Works with **all solvers** (including `liblinear`).
- Faster for **very large datasets** with **many classes**.
- Easier to interpret.
> **Disadvantages:**  
- Requires training **K models**, increases time complexity.
- May have **inconsistent probability estimates**.

**Softmax (Multinomial Logistic Regression)**
- Uses a **single model** to compute probabilities for all classes.
- Relies on the **Softmax function** to assign probabilities.

**Advantages:**  
- **More efficient** than OvR for smaller datasets.
- Provides **better probability calibration**.

**Disadvantages:**  
- Only works with specific solvers (`lbfgs`, `saga`, `newton-cg`).
- May not scale well for datasets with **many classes (K >> 10)**.

**Summary?**
| Scenario | Recommended Approach |
|----------|----------------------|
| Small dataset | **Softmax (Multinomial)** |
| Large dataset, many classes (K > 10) | **OvR** |
| Need probability calibration | **Softmax (Multinomial)** |
| Using `liblinear` solver | **OvR** |
| Faster training for high-dimensional data | **OvR** |


### **20. How do we interpret coefficients in Logistic Regression?**


In Logistic Regression, the coefficients $\theta$ represent the **log-odds change** in the dependent variable(y) for a **one-unit increase** in the predictor variable(X).


**Understanding the Coefficients**
For a given feature \( x_i \), the probability of the positive class is:

$P(y=1|x) = \frac{1}{1 + e^{-(\theta_0 + \theta_1 x_1 + \dots + \theta_n x_n)}}$


- $\theta_j \text{the coefficient for feature}, x_j$  tells us how much the **log-odds** of the positive class changes when $x_j$ increases by 1 unit.

- The log-odds (logit) transformation:


$\log\left(\frac{P(y=1)}{P(y=0)}\right) = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n$

**Interpreting Coefficients in Terms of Odds Ratio**
- The exponentiated coefficient \( e^{\theta_j} \) gives the **odds ratio**.
- If \( e^{\theta_j} > 1 \), the feature **increases** the likelihood of the positive class.
- If \( e^{\theta_j} < 1 \), the feature **decreases** the likelihood of the positive class.
- If \( \theta_j = 0 \), the feature has **no effect**.

**Example:**  
If $\theta_1 = 0.7$, then $e^{0.7} \approx 2.01$, meaning a **one-unit increase in  $x_1$ doubles the odds** of $ y=1$.



**Multiclass Logistic Regression**
- For **multiclass classification (Softmax)**, each class has a separate set of coefficients.
- Interpretation is done relative to a **reference class**.

