## logistic regression 

 logistic regression multi-dimensional data
 
 
 $$ F(X)=X \times W $$
 $$ H(x)= \frac{1}{1+ e ^{-F(x)}} $$
 $$ C= -\frac{1}{n} \sum_{i,j} (Y \odot log(H(x)) + (1-Y) \odot log(1-H(x)) ) $$

$X_{n \times k}$

$W_{k \times p}$

$Y_{n \times p}$

# Logistic Regression

Logistic regression is a statistical model used for **binary classification** (can be extended to multi-class). It predicts the **probability** that an instance belongs to a particular class.

Despite its name, it is a **classification algorithm**, not regression. The "regression" part refers to using a regression-like approach to estimate probabilities.



### 1. Core Mathematical Concepts

#### Logit Function (Log-Odds)
The "Logit" is the natural log of the odds ratio. It maps probability values from $(0, 1)$ to $(-\infty, +\infty)$.

$$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n$$

* **$p$:** Probability of success (class = 1)
* **$1-p$:** Probability of failure (class = 0)
* **$\ln(\frac{p}{1-p})$:** Log-odds

#### Sigmoid Function
The Sigmoid function maps the output of the linear equation ($z$) back to a probability range $(0, 1)$.

$$p = \sigma(z) = \frac{1}{1 + e^{-z}}$$

* **Where:** $z = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n$
* **Properties:**
    * **Output range:** $(0, 1)$
    * **Shape:** S-shaped curve
    * **Derivative:** $\sigma'(z) = \sigma(z)(1-\sigma(z))$

---

### 2. Assumptions of Logistic Regression
* **Binary Outcome:** The target variable is binary (for standard logistic regression).
* **Linearity of Log-Odds:** There is a linear relationship between the independent variables and the log-odds of the target.
* **No Multicollinearity:** Independent variables should not be highly correlated with each other.
* **Independence:** Observations must be independent of each other.
* **Large Sample Size:** Rule of thumb is at least 10 events per predictor variable.

---

### 3. Cost Function & Optimization

#### Cost Function: Cross-Entropy Loss (Log Loss)
We cannot use Mean Squared Error (MSE) because it creates a non-convex loss surface (many local minima). Instead, we use **Cross-Entropy**, which is convex.

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]$$

* $h_\theta(x)$: Predicted probability
* $y$: Actual label (0 or 1)
* $m$: Number of samples

#### Optimization: Gradient Descent
We minimize the Cost Function using Gradient Descent or solvers like `liblinear`/`lbfgs`.

**Update Rule:**
$$\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

**Gradient Calculation:**
$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

---

### 4. Model Evaluation Metrics



[Image of ROC curve classification]


#### Classification Metrics
* **Accuracy:** $\frac{TP+TN}{TP+TN+FP+FN}$
* **Precision:** $\frac{TP}{TP+FP}$ (Focus on minimizing False Positives)
* **Recall (Sensitivity):** $\frac{TP}{TP+FN}$ (Focus on minimizing False Negatives)
* **F1-Score:** $2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
* **Specificity:** $\frac{TN}{TN+FP}$

#### Probability & Calibration Metrics
* **Log Loss:** Measures the uncertainty/quality of predicted probabilities.
* **ROC-AUC:** Area Under the Receiver Operating Characteristic curve. Measures the model's ability to distinguish between classes.
* **Precision-Recall AUC:** Better for highly imbalanced datasets.

#### Threshold Selection
Default threshold is **0.5**, but it should be tuned based on business needs:
* Use **ROC Curve** to find the optimal balance.
* Use **Youden's J statistic:** $J = \text{Sensitivity} + \text{Specificity} - 1$.
* Consider the **Business Cost** of False Positives vs. False Negatives.

---

### 5. Regularization (Preventing Overfitting)

#### L1 Regularization (Lasso)
Adds the absolute value of magnitude of coefficients as a penalty term.
$$J(\theta) = \text{CrossEntropy} + \lambda \sum |\theta_j|$$
* **Effect:** Creates **sparse models** (Feature Selection). Can zero out coefficients entirely.

#### L2 Regularization (Ridge)
Adds the squared magnitude of coefficients as a penalty term.
$$J(\theta) = \text{CrossEntropy} + \lambda \sum \theta_j^2$$
* **Effect:** Shrinks coefficients toward zero (but not exactly zero). Handles multicollinearity better.

#### Elastic Net
* A linear combination of **L1** and **L2** regularization.

---

### 6. Multiclass Logistic Regression

1.  **One-vs-Rest (OvR):** Trains $k$ binary classifiers (e.g., Red vs. Not Red, Blue vs. Not Blue).
2.  **Multinomial (Softmax) Regression:** Direct generalization using the Softmax function.
    $$P(y=k|x) = \frac{e^{\theta_k \cdot x}}{\sum_{j} e^{\theta_j \cdot x}}$$

---

### 7. Feature Importance & Interpretation

#### Odds Ratio
$$\text{OR} = e^{\beta}$$
* **Interpretation:** For a 1-unit increase in $x$, the odds of the event happening increase by a factor of $e^{\beta}$.

#### Statistical Significance
* **Wald Test:** $z = \frac{\beta}{SE(\beta)}$ (Tests if coefficient $\ne 0$).
* **P-values:** Used to determine statistical significance of features.
* **Confidence Intervals:** Defines the range for odds ratios.

---

### 8. Practical Considerations

#### Handling Categorical Variables
* Use **One-Hot Encoding** (Dummy Variables).
* **Avoid Dummy Variable Trap:** Drop one category (k-1 dummies) to prevent perfect multicollinearity.
* Use **Target/Mean Encoding** for high-cardinality features.

#### Feature Scaling
* **Not strictly required** for the core logic (unlike KNN).
* **Highly Recommended** because:
    1.  It helps Gradient Descent converge much faster.
    2.  It is **Mandatory** if you are using Regularization (L1/L2), otherwise larger features will be penalized unfairly.

#### Checking Assumptions
* **Linearity:** Use Box-Tidwell test or residual plots.
* **Influential Points:** Check Cook's Distance.
* **Multicollinearity:** Check VIF (Variance Inflation Factor).

---

### 9. Advantages vs. Disadvantages

| Advantages | Disadvantages |
| :--- | :--- |
| Outputs probabilities (not just classes). | Assumes a linear decision boundary. |
| Highly interpretable (Odds Ratios). | Sensitive to outliers. |
| Efficient to train. | Requires feature engineering for non-linear relationships. |
| Works well with small datasets. | Can underperform on complex patterns compared to Trees/NNs. |
| Less prone to overfitting (with Regularization). | |

# Logistic Regression

Despite its name, **Logistic Regression** is a **Classification** algorithm, not a regression algorithm. It is used to predict the probability of a target variable belonging to a certain class (e.g., 0 or 1, Yes or No).



### Core Concept: The Sigmoid Function
Linear Regression fits a straight line that can go from $-\infty$ to $+\infty$. This doesn't work for probability (which must be between 0 and 1).

Logistic Regression solves this by squashing the output of a linear equation into an **S-shaped curve** using the **Sigmoid Function**.

* **Linear Equation:** $z = w \cdot x + b$
* **Sigmoid Transformation:** $\sigma(z) = \frac{1}{1 + e^{-z}}$

**Result:** The output is always between **0 and 1**, representing the **probability** ($P$) that the event will happen.

### How it Decides (Decision Boundary)
Once the model calculates the probability, it applies a **threshold** (usually 0.5) to classify the data.

* If $P \ge 0.5 \rightarrow$ Class 1 (e.g., "Spam")
* If $P < 0.5 \rightarrow$ Class 0 (e.g., "Not Spam")

---

### Key Differences: Linear vs. Logistic Regression

| Feature | Linear Regression | Logistic Regression |
| :--- | :--- | :--- |
| **Purpose** | Predict continuous values (Price, Temp). | Predict categorical outcomes (Yes/No). |
| **Output** | Any number ($-\infty$ to $+\infty$). | Probability ($0$ to $1$). |
| **Curve** | Straight Line. | S-Shaped Curve (Sigmoid). |
| **Cost Function** | Mean Squared Error (MSE). | **Log Loss** (Binary Cross Entropy). |

---

### Types of Logistic Regression
1.  **Binary:** Target has two possible outcomes (e.g., Pass/Fail).
2.  **Multinomial:** Target has three or more nominal categories (e.g., Cat/Dog/Bird) â€“ uses the **Softmax** function.
3.  **Ordinal:** Target has three or more ordinal categories (e.g., Low/Medium/High).

### Assumptions
* **Linearity of Log-Odds:** The independent variables are linearly related to the log-odds of the target.
* **No Multicollinearity:** Independent variables should not be highly correlated with each other.
* **Large Sample Size:** Generally requires a larger dataset than linear regression to achieve stable results.

---


In [1]:
import numpy as np
import random

In [2]:
n, k, p=100, 8, 3 

In [3]:
X=np.random.random([n,k])
W=np.random.random([k,p])

y=np.random.randint(p, size=(1,n))
Y=np.zeros((n,p))
Y[np.arange(n), y]=1

max_itr=5000
alpha=0.01
Lambda=0.01

Gradient is as follows:
$$ X^T (H(x)-Y) + \lambda 2 W$$

In [4]:
# F(x)= w[0]*x + w[1]
def F(X, W):
    return np.matmul(X,W)

def H(F):
    return 1/(1+np.exp(-F))

def cost(Y_est, Y):
    E= - (1/n) * (np.sum(Y*np.log(Y_est) + (1-Y)*np.log(1-Y_est)))  + np.linalg.norm(W,2)
    return E, np.sum(np.argmax(Y_est,1)==y)/n

def gradient(Y_est, Y, X):
    return (1/n) * np.matmul(X.T, (Y_est - Y) ) + Lambda* 2* W

In [5]:
def fit(W, X, Y, alpha, max_itr):
    for i in range(max_itr):
        
        F_x=F(X,W)
        Y_est=H(F_x)
        E, c= cost(Y_est, Y)
        Wg=gradient(Y_est, Y, X)
        W=W - alpha * Wg
        if i%1000==0:
            print(E, c)
        
    return W, Y_est

To take into account for the biases, we concatenate X by a 1 column, and increase the number of rows in W by one

In [6]:
X=np.concatenate( (X, np.ones((n,1))), axis=1 ) 
W=np.concatenate( (W, np.random.random((1,p)) ), axis=0 )

W, Y_est = fit(W, X, Y, alpha, max_itr)

9.368653735228364 0.31
4.994251188297815 0.43
4.951873226767272 0.48
4.922370610237865 0.47
4.901694423284286 0.48


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Create model with regularization
model = LogisticRegression(
    penalty='l2',           # Regularization type
    C=1.0,                  # Inverse of regularization strength
    solver='lbfgs',         # Optimization algorithm
    max_iter=1000,          # Maximum iterations
    class_weight='balanced' # Handle imbalanced data
)

# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Evaluate
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Interpret coefficients
for feature, coef in zip(feature_names, model.coef_[0]):
    print(f"{feature}: {coef:.4f} (OR: {np.exp(coef):.4f})")