<a href="https://colab.research.google.com/github/ThanhVanLe0605/Data-Mining-For-Business-Analytics-In-Python/blob/main/Chapter_10_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LOGISTIC REGRESSION


[LOGISTIC REGRESSION OVERVIEW](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=1WUkAZJzubtP&line=1&uniqifier=1)

10.1. [INTRODUCTION](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=eMDc_ZO-xgni&line=1&uniqifier=1)

10.2. [Logistic Regression: The Math Behind the model](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=T6PKwUll25Ec&line=1&uniqifier=1)

10.3. [Example: Acceptance of Personal Loan](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=mT1o6tNJ7PEc&line=68&uniqifier=1)

10.4. [Evaluating Classification Performance](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=fQLDXLQIFid_&line=46&uniqifier=1)

10.5. [Multi-class Logistic Regression](https://colab.research.google.com/drive/1rhDM8FPZFLhMkj6Yo0xeRjnCvGXI_pf3#scrollTo=4jTDgoJHHQf8&line=1&uniqifier=1)

## LOGISTIC REGRESSION OVERVIEW


This section introduces **Logistic Regression**, a highly popular and powerful method used for classification tasks.

* **Methodology & Setup**
    * It models the relationship between **predictors** and a specific **outcome**, similar to linear regression.
    * Users must explicitly specify predictors and their forms (e.g., interaction terms).

* **Key Advantages**
    * a. Effective even on **small datasets**.
    * b. **Computationally efficient**: Once the model is estimated, classifying large samples of new records is fast and cheap.

* **Core Concepts & Estimation**
    * Focuses on model formulation and estimation from data.
    * Explains the fundamental relations between **"logit"**, **"odds"**, and **"probability"** of an event.

* **Advanced Topics**
    * a. **Variable importance** and **coefficient interpretation**.
    * b. **Variable selection** techniques for **dimension reduction**.
    * c. Extensions to **multi-class classification** problems.

### Python

In this chapter, we will use `pandas` for data handling, `scikit-learn` and `statsmodels` for the models, and `matplotlib` for visualization. We will also make use of the utility functions from the Python Utilities Functions Appendix. Use the following import statements for the Python code in this chapter.

```python
# import required functionality for this chapter

In [None]:
# import required functionality for this chapter
!pip install mord dmba
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from mord import LogisticIT
import matplotlib.pyplot as plt
import seaborn as sns
from dmba import classificationSummary, gainsChart, liftChart
from dmba.metric import AIC_score



## 10.1. INTRODUCTION






This section details the scope, application, and mechanics of the Logistic Regression model.

* **Core Concept & Application**
    * a. **Purpose**: Extends linear regression to handle **categorical outcomes** (classes) rather than continuous values.
    * b. **Primary Uses**:
        * **Classification**: Predicting the class of a new record based on predictors.
        * **Profiling**: Identifying factors that distinguish between classes in known data.
    * c. **Target Variable**: Focuses primarily on **binary outcomes** (e.g., Success/Failure, 0/1). Continuous variables are sometimes converted (binned) into binary classes for simplification.

* **Mechanism: Propensities and Cutoffs**
    Logistic regression operates in two distinct steps to classify records:
    * a. **Step 1: Estimation**: The model calculates the **propensity** (or probability) that a record belongs to the class of interest, denoted as p= (Y=1).
    * b. **Step 2: Classification via Cutoff**:
        * A **cutoff value** is applied to the estimated probabilities to assign classes.
        * **Standard Rule**: Typically, if $P(Y=1) \ge 0.5$, the record is classified as Class 1.
        * **Adjustments**: For rare but critical events (e.g., fraud), the cutoff may be lowered to capture more Class 1 cases.

## 10.2. Logistic Regression: The Math Behind the model:




This section explains the mathematical formulation linking predictors to the probability of an outcome.

* **The Limitation of Linear Regression**
    * Standard linear regression cannot be used directly for classification because it may predict values outside the required probability range of [0, 1].
    * **Solution**: Use a nonlinear function (Logistic Response Function) to ensure predictions stay within [0, 1].

* **Key Concepts & Relationships**
    * a. **Probability ($p$)**:
        * The probability of belonging to class 1: p = P(Y=1).
        * **Range**: [0, 1].
        * **Formula**: p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots + \beta_q x_q)}}
    * b. **Odds**:
        * The ratio of the probability of the event happening to it *not* happening.
        * **Formula**: $\text{Odds}(Y=1) = \frac{p}{1-p}$
        * **Range**: $[0, \infty)$ (from 0 to infinity).
        * **Relationship with predictors**: Multiplicative (exponential).
    * c. **Logit (Log-Odds)**:
        * The natural logarithm of the odds.
        * **Formula**: $\text{logit} = \log(\text{Odds}) = \beta_0 + \beta_1 x_1 + \dots + \beta_q x_q$
        * **Range**: $(-\infty, +\infty)$.
        * **Relationship with predictors**: **Linear**. This allows us to use linear regression techniques to estimate the coefficients.

* **Summary of Transformation Steps**
    1.  Predictors ($X$) $\rightarrow$ Linear Equation $\rightarrow$ **Logit**
    2.  **Logit** $\rightarrow$ Exponentiation ($e^{logit}$) $\rightarrow$ **Odds**
    3.  **Odds** $\rightarrow$ Mapping ($\frac{Odds}{1+Odds}$) $\rightarrow$ **Probability ($p$)**

## 10.3 Example: Acceptance of Personal Loan
Logistic Regression: Personal Loan Acceptance Case Study

###**1. Context & Problem Definition**

a. **Data Context:** The Universal Bank dataset contains 5000 customer records.

b. **Target Variable ($Y$):** `Personal Loan` (Binary: Did the customer accept the loan offer in the last campaign?).(

c. **Statistics:** Only 480 customers (9.6%) accepted the loan (imbalanced classes).

d. **Objective:** Build a **classification model** to identify customers most likely to accept a loan offer in future campaigns.

---

###**2. Model with a Single Predictor**

a. **Concept:** Similar to simple linear regression, but the outcome variable $Y$ is categorical.

b. **Predictor ($X$):** Using `Income` to classify customers.

c. **Probability Formula ($P$):** The probability of accepting a loan given income $x$ is calculated using the Logistic function (Sigmoid):
   $$P(Personal Loan = Yes | Income = x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}$$

d. **Odds Formula:**
   $$\text{Odds}(Personal Loan = Yes | Income = x) = e^{\beta_0 + \beta_1 x}$$


e. **Classification Mechanism:**
   - The model outputs a probability $p$ between 0 and 1.
   - A **cutoff value** is applied to classify the result as 1 (Accept) or 0 (Reject). If $p > \text{cutoff}$, predict 1.


---

###**3. Estimating the Model (MLE vs. Least Squares)**

a. **Methodology:** Unlike Linear Regression which uses *Least Squares*, Logistic Regression uses **Maximum Likelihood Estimation (MLE)**.

b. **Reasoning:** The relationship between the target $Y$ and parameters $\beta$ is non-linear.

c. **MLE Principle:** Finds the parameters that maximize the chance (likelihood) of obtaining the observed data.


---


###**4. Data Preprocessing**

a. **Categorical Variables:**
   - Variables like `Education` (levels 1, 2, 3) must be converted into **dummy variables** (one-hot encoding).
   - **Multicollinearity Note:** To avoid perfect multicollinearity, keep only $k-1$ dummy variables for $k$ categories.
b. **Data Splitting:**
   - **Training set:** 60% (for model fitting).
   - **Validation set:** 40% (for performance evaluation).


---

###**5. Estimated Model & Logit Equation**

a. **The Logit:** The natural logarithm of Odds ($ln(Odds)$). It has a **linear relationship** with the predictors, allowing us to see the additive effect of variables.

b. **Estimated Equation (12 Predictors):**
   $$
   \begin{aligned}
   \text{Logit}(Personal Loan = Yes) = & -12.619 - 0.0325(\text{Age}) + 0.0342(\text{Experience}) \\
   & + 0.0588(\text{Income}) + 0.6141(\text{Family}) + 0.2405(\text{CCAvg}) \\
   & + 0.0010(\text{Mortgage}) - 1.0262(\text{Securities\_Account}) \\
   & + 3.6479(\text{CD\_Account}) - 0.6779(\text{Online}) - 0.9560(\text{Credit Card}) \\
   & + 4.1922(\text{Education\_Graduate}) \\
   & + 4.3417(\text{Education\_Advanced/Professional})
   \end{aligned}
   $$
c. **Interpretation:**
   - **Positive coefficients** (e.g., `Income`, `CD_Account`) increase the probability of loan acceptance.
   - **Negative coefficients** (e.g., `Credit Card`, `Online`) decrease the probability.


---

###**6. Interpreting Results via Odds Ratios**

a. **Odds Ratio Formula:**
   $$\text{Odds Ratio} = e^{\beta_1}$$
b. **Meaning:** $e^{\beta_1}$ is the multiplicative factor impact on the Odds when $X_1$ increases by 1 unit (holding other variables constant).
   - If $\beta_1 > 0$: $e^{\beta_1} > 1$ (Odds increase).
   - If $\beta_1 < 0$: $e^{\beta_1} < 1$ (Odds decrease).
c. **Examples:**
   - **Income ($\beta \approx 0.036$):** A 1-unit increase in Income increases the odds of acceptance by a factor of $e^{0.036}$.
   - **CD_Account ($\beta \approx 3.65$):** Customers with a CD Account have odds of acceptance approx. **38.4 times** ($e^{3.65}$) higher than those without.

## 10.4 Evaluating Classification Performance

### 1. Evaluating Classification Performance
To assess how well a Logistic Regression model performs, we use several metrics and visualizations beyond simple accuracy.

* **Key Metrics:**
    * **Confusion Matrix:** A table showing True Positives, True Negatives, False Positives, and False Negatives.
    * **Accuracy:** The overall percentage of correct predictions.
    * **Ranking Goal:** In many business cases (e.g., credit scoring), ranking customers by their probability of belonging to a class is more important than just classifying them.

* **Visualizations:**
    * **Gains Chart & Lift Chart:** These evaluate the model's ability to identify targets compared to random selection.
        * *Interpretation:* A "Lift" implies how much better the model is at identifying the target class compared to a naive baseline. For example, the top 10% of customers identified by the model might contain 7.8 times more actual responders than a random 10% sample.

---

### 2. Interpreting Model Output
Understanding the relationship between predictors and the outcome is crucial.

* **Coefficients ($\beta$):** Represent the change in the **Logit** (log-odds) for a one-unit increase in the predictor.
    * Positive $\beta$: Increases the probability of the event.
    * Negative $\beta$: Decreases the probability of the event.
* **Odds Ratios (O.R. = $e^{\beta}$):** A more intuitive measure.
    * *Interpretation:* If O.R. = 1.05, it means a one-unit increase in the predictor increases the odds of the event by 5%.
    * If O.R. > 1: Positive relationship.
    * If O.R. < 1: Negative relationship.
* **P-values:** Determine statistical significance. A low p-value (typically < 0.05) indicates the predictor is significantly related to the outcome (e.g., *Income* and *Education* in the bank loan example).

---

### 3. Variable Selection & Model Validation
Finding the right balance between model simplicity (parsimony) and accuracy.

* **Selection Methods:**
    * **Automated Heuristics:** Stepwise, Forward Selection, and Backward Elimination (often minimizing **AIC**).
    * **Regularization:** Using **L1 (Lasso)** or **L2 (Ridge)** penalties to prevent overfitting. In Python, this is controlled by the `C` parameter (inverse of regularization strength).
    * **Interaction Terms:** Adding terms like $Income \times Family$   if variables have combined effects.

* **Profiling via Deciles:**
    * Analyzing the characteristics of the "Top Decile" (top 10% highest probability) vs. the overall average helps build a profile of the target audience (e.g., "Targets have higher income and education").

* **CRITICAL NOTE: The Danger of "Overly Optimistic" Performance**
    * **The Issue:** Relying solely on **Validation Data** for performance evaluation can be misleading. Since validation data is used to *select* the best model (tuning), the model implicitly "learns" the specific noise of the validation set.
    * **The Solution:** Always reserve a separate **Test Set** (Unseen Data) that is never used during the training or model selection process. This provides an unbiased estimate of how the model will perform in the real world.



## 10.5  Multi-class Logistic Regression


When the target variable has more than two classes ($m > 2$), the binary logistic model is extended. Since the sum of probabilities must equal 1, we estimate $m-1$ probabilities.

### 1. Ordinal Classes (Ordered Categories)
Used when classes have a meaningful order (e.g., *Buy, Hold, Sell* or *Low, Medium, High*). The method is often called **Cumulative Logit** or **Proportional Odds**.

* **Key Concept:** Model the cumulative probability $P(Y \le j)$.
* **Assumption:** The predictors have the **same slope ($\beta$)** across all class levels, but different intercepts ($\alpha$).

**Formulas (Example with $m=3$ classes):**
The logit functions for the cumulative probabilities are:
$$
\text{logit}(Y \le 1) = \ln \left( \frac{P(Y \le 1)}{1 - P(Y \le 1)} \right) = \alpha_0 + \beta_1 x
$$
$$
\text{logit}(Y \le 2) = \ln \left( \frac{P(Y \le 2)}{1 - P(Y \le 2)} \right) = \beta_0 + \beta_1 x
$$

**Recovering Probabilities:**
$$
P(Y=1) = \frac{1}{1 + e^{-(\alpha_0 + \beta_1 x)}}
$$


$$
P(Y=2) = P(Y \le 2) - P(Y \le 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} - P(Y=1)
$$




$$
P(Y=3) = 1 - P(Y \le 2)
$$

---

### 2. Nominal Classes (Unordered Categories)
Used when classes have no intrinsic order (e.g., *Brand A, Brand B, Brand C*). We use **Multinomial Logistic Regression**.

* **Key Concept:** Select one class as the **Reference Class** (e.g., Class C). Model the log-odds of membership in other classes relative to the reference.
* **Assumption:** Each class comparison has its **own unique set of coefficients** (different slopes and intercepts).

**Formulas (Example with Reference = C):**
The "pseudo-logit" equations are:
$$
\text{logit}(A) = \ln \left( \frac{P(Y=A)}{P(Y=C)} \right) = \alpha_0 + \alpha_1 x
$$
$$
\text{logit}(B) = \ln \left( \frac{P(Y=B)}{P(Y=C)} \right) = \beta_0 + \beta_1 x
$$

**Recovering Probabilities (Softmax):**
$$
P(Y=A) = \frac{e^{\alpha_0 + \alpha_1 x}}{1 + e^{\alpha_0 + \alpha_1 x} + e^{\beta_0 + \beta_1 x}}
$$
$$
P(Y=B) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\alpha_0 + \alpha_1 x} + e^{\beta_0 + \beta_1 x}}
$$
$$
P(Y=C) = 1 - P(Y=A) - P(Y=B)
$$



### Summary Table: Ordinal vs. Nominal

| Feature | Ordinal Logistic Regression | Nominal Logistic Regression |
| :--- | :--- | :--- |
| **Use Case** | Ranked data (Severity, Ratings) | Unordered data (Brands, Types) |
| **Slopes ($\beta$)** | **Shared** (Parallel lines assumption) | **Separate** for each class |
| **Intercepts ($\alpha$)** | Separate | Separate |
| **Complexity** | More parsimonious (fewer parameters) | More complex (more parameters) |

## 10.6. Example of complete analysis: predicting delayed flights

