## Day 31 — Logistic Regression Case Study, Likelihood & Intro to SVM

This notebook is part of my **Machine Learning Learning Journey** and focuses on a
**Logistic Regression case study**, covering the **probabilistic foundation**,
**likelihood formulation**, **loss derivation**, and **model assumptions**.

The session also introduces the **intuition behind Support Vector Machines (SVM)**
as a transition to margin-based classifiers.

---


## 1. Recap: Logistic Regression Model

Linear model:
\[
z = w_1x_1 + w_2x_2 + \dots + w_nx_n + w_0
\]

Logistic Regression applies a transformation:
\[
\log\left(\frac{p}{1-p}\right) = w^Tx + w_0
\]

Where:
- \(p = P(y=1|x)\)
- Output is probability, not a raw value


## 2. Why Logistic Regression?

- Linear Regression gives unbounded output
- Classification requires probabilities in [0, 1]
- Logistic Regression provides:
  - Credible feature importance
  - Statistical interpretation
  - Probabilistic decision making

Logistic Regression is a **parametric classification algorithm**.


## 3. Log-Odds (Logit) Transformation

Odds:
\[
\text{odds} = \frac{p}{1-p}
\]

Log-odds:
\[
\log\left(\frac{p}{1-p}\right) = w^Tx + w_0
\]

Applying inverse transformation (Sigmoid):
\[
p = \frac{1}{1 + e^{-(w^Tx + w_0)}}
\]

This maps any real value to (0, 1).


## 4. Classification Rule

Default threshold:
- \(p \ge 0.5 \rightarrow y = 1\)
- \(p < 0.5 \rightarrow y = 0\)

Note:
- Threshold is **not always 0.5**
- Threshold selection depends on:
  - ROC Curve
  - Cross-validation
  - Business cost


## 5. Case Study: Student Marks Example

Feature:
- Hours studied

Target:
- Pass / Fail

Linear model:
\[
z = w_1 \cdot \text{hours} + w_0
\]

Logistic model:
\[
p = \frac{1}{1 + e^{-z}}
\]

Higher study hours → higher probability of passing


## 6. Why Squared Error is Not Used

Squared Error:
\[
(y - \hat{y})^2
\]

Problems:
- Poor probabilistic interpretation
- Non-convex for classification
- Penalizes confident correct predictions

Hence, **Likelihood-based optimization** is used.


## 7. Probability of an Event

For binary outcome:
\[
y \in \{0, 1\}
\]

Probability of observing \(y\):
\[
P(y|p) = p^y (1-p)^{(1-y)}
\]

This formulation unifies both cases:
- If \(y=1\) → probability = \(p\)
- If \(y=0\) → probability = \(1-p\)


## 8. Likelihood Function

For multiple independent observations:
\[
L = \prod_{i=1}^{m} p_i^{y_i}(1-p_i)^{(1-y_i)}
\]

Goal:
- Choose parameters that **maximize likelihood**


## 9. Log-Likelihood

Applying log:
\[
\log L = \sum_{i=1}^{m} \left[
y_i \log(p_i) + (1-y_i)\log(1-p_i)
\right]
\]

Why log?
- Converts product → sum
- Numerically stable
- Easier optimization


## 10. Negative Log Likelihood (NLL)

Optimization objective:
\[
\min -\sum_{i=1}^{m}
\left[
y_i \log(p_i) + (1-y_i)\log(1-p_i)
\right]
\]

This is called:
- Binary Cross Entropy (BCE)
- Log Loss


## 11. Solving Logistic Regression

Two approaches:
- Maximum Likelihood Estimation (MLE)
- Gradient Descent (practical)

In practice:
- `sklearn` uses numerical optimization
- Deep Learning models use backpropagation


## 12. Assumptions of Logistic Regression

1. Binary target variable  
2. Independent observations  
3. Linear relationship between features and log-odds  
4. No multicollinearity  
5. No influential outliers  
6. Large sample size preferred  


## 13. Handling Multicollinearity & Feature Selection

- Use VIF to detect multicollinearity
- Feature selection:
  - Manual (p-values < 0.05)
  - Automatic (RFE)

After fitting:
- If train >> test performance → overfitting
- Apply regularization


## 14. Regularization

- L1 (Lasso):
  - Feature elimination
- L2 (Ridge):
  - Shrinks weights
- Elastic Net:
  - Combination of L1 & L2

Used to:
- Reduce overfitting
- Improve generalization


## 15. Introduction to Support Vector Machines (SVM)

Key idea:
- Find the **best separating hyperplane**
- Maximize margin between classes

SVM focuses on:
- Boundary points (support vectors)
- Margin maximization
- Robust classification
