## logistic regression 

 logistic regression multi-dimensional data
 
 
 $$ F(X)=X \times W $$
 $$ H(x)= \frac{1}{1+ e ^{-F(x)}} $$
 $$ C= -\frac{1}{n} \sum_{i,j} (Y \odot log(H(x)) + (1-Y) \odot log(1-H(x)) ) $$

$X_{n \times k}$

$W_{k \times p}$

$Y_{n \times p}$

Logistic regression is a statistical model used for binary classification (can be extended to multi-class). It predicts the probability that an instance belongs to a particular class.

Despite its name, it's a classification algorithm, not regression. The "regression" part refers to using a regression-like approach to estimate probabilities.

### Logit Function (Log-Odds)
``` logit(p) = ln(p/(1-p)) = β₀ + β₁x₁ + ... + βₙxₙ  ```

Where:

p = probability of success (class = 1)

1-p = probability of failure (class = 0)

ln(p/(1-p)) = log-odds


### Sigmoid Function
```bash
p = σ(z) = 1 / (1 + e^(-z))
where z = β₀ + β₁x₁ + ... + βₙxₙ

```

Properties of Sigmoid:

Output range: (0, 1)

S-shaped curve

Derivative: σ'(z) = σ(z)(1-σ(z))

### Assumptions of Logistic Regression

- Binary outcome variable (for standard logistic regression)

- Linearity of independent variables and log odds (logit linearity)

- No multicollinearity among independent variables

- Independence of observations

- Large sample size (rule of thumb: at least 10 events per predictor variable)

###  Cost Function: Cross-Entropy Loss
``` J(θ) = -1/m ∑ [y⁽ⁱ⁾ log(hθ(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾) log(1-hθ(x⁽ⁱ⁾))]  ```


Where:

hθ(x) = predicted probability

y = actual label (0 or 1)

m = number of samples

Why not MSE? MSE would give non-convex loss surface, making optimization difficult.

### Model Training: Optimization

Maximum Likelihood Estimation (MLE)
Find parameters that maximize likelihood of observing the data:
``` L(θ) = ∏ hθ(x⁽ⁱ⁾)^y⁽ⁱ⁾ (1-hθ(x⁽ⁱ⁾))^(1-y⁽ⁱ⁾)  ```
Maximizing likelihood = minimizing negative log-likelihood (cross-entropy)

### Optimization Algorithms
  Gradient Descent
  ```bash  
   θ_j := θ_j - α ∂J(θ)/∂θ_j
   ∂J(θ)/∂θ_j = 1/m ∑ (hθ(x⁽ⁱ⁾) - y⁽ⁱ⁾) x_j⁽ⁱ⁾
   ```


### Model Evaluation Metrics

Classification Metrics:
Accuracy: (TP+TN)/(TP+TN+FP+FN)

Precision: TP/(TP+FP)

Recall/Sensitivity: TP/(TP+FN)

F1-Score: 2(PrecisionRecall)/(Precision+Recall)

Specificity: TN/(TN+FP)

Probability Calibration Metrics:
Log Loss: Measures quality of predicted probabilities

ROC-AUC: Area under ROC curve (measures separability)

Precision-Recall AUC: Better for imbalanced data


### Threshold Selection
Default threshold = 0.5, but can be tuned:

Use ROC curve to find optimal threshold
Consider business costs of FP vs FN
Use Youden's J statistic: J = Sensitivity + Specificity - 1

### Regularization in Logistic Regression

L1 Regularization (Lasso):
``` J(θ) = CrossEntropy + λ∑|θ_j| ```
Creates sparse models (feature selection)

Can zero out coefficients

L2 Regularization (Ridge):
``` J(θ) = CrossEntropy + λ∑θ_j² ```

Shrinks coefficients toward zero

Handles multicollinearity better

### Elastic Net: Combination of L1 and L2


## Multiclass Logistic Regression
Two Approaches:
One-vs-Rest (OvR): Train k binary classifiers

Multinomial/Softmax Regression: Direct generalization
``` P(y=k|x) = e^(θ_k·x) / ∑ e^(θ_j·x)  ```

##  Feature Importance & Interpretation
  ### Odds Ratio
  ``` OR = e^β ```
  Interpretation: For 1-unit increase in x, odds increase by factor of e^β

### Statistical Significance
Wald Test: z = β/SE(β) ~ N(0,1)

p-values: Test if coefficient ≠ 0

Confidence Intervals: CI for odds ratios

- Handling Categorical Variables
Use one-hot encoding (dummy variables)

Avoid dummy variable trap (drop one category)

Consider target/mean encoding for high-cardinality features

- Feature Scaling
Not required for logistic regression (unlike k-NN or SVM)

But helps gradient descent converge faster

Required for regularization to work properly

- Checking Model Assumptions
Linearity: Box-Tidwell test or residual plots

Influential points: Cook's distance

Multicollinearity: VIF (Variance Inflation Factor)

##### Advantages:
Outputs probabilities

Interpretable (odds ratios)

Efficient to train

Works well with small datasets

Less prone to overfitting with regularization

##### Disadvantages:
Assumes linear decision boundary

Sensitive to outliers

Requires feature engineering for non-linear relationships

Can underperform with complex patterns








In [1]:
import numpy as np
import random

In [2]:
n, k, p=100, 8, 3 

In [3]:
X=np.random.random([n,k])
W=np.random.random([k,p])

y=np.random.randint(p, size=(1,n))
Y=np.zeros((n,p))
Y[np.arange(n), y]=1

max_itr=5000
alpha=0.01
Lambda=0.01

Gradient is as follows:
$$ X^T (H(x)-Y) + \lambda 2 W$$

In [4]:
# F(x)= w[0]*x + w[1]
def F(X, W):
    return np.matmul(X,W)

def H(F):
    return 1/(1+np.exp(-F))

def cost(Y_est, Y):
    E= - (1/n) * (np.sum(Y*np.log(Y_est) + (1-Y)*np.log(1-Y_est)))  + np.linalg.norm(W,2)
    return E, np.sum(np.argmax(Y_est,1)==y)/n

def gradient(Y_est, Y, X):
    return (1/n) * np.matmul(X.T, (Y_est - Y) ) + Lambda* 2* W

In [5]:
def fit(W, X, Y, alpha, max_itr):
    for i in range(max_itr):
        
        F_x=F(X,W)
        Y_est=H(F_x)
        E, c= cost(Y_est, Y)
        Wg=gradient(Y_est, Y, X)
        W=W - alpha * Wg
        if i%1000==0:
            print(E, c)
        
    return W, Y_est

To take into account for the biases, we concatenate X by a 1 column, and increase the number of rows in W by one

In [6]:
X=np.concatenate( (X, np.ones((n,1))), axis=1 ) 
W=np.concatenate( (W, np.random.random((1,p)) ), axis=0 )

W, Y_est = fit(W, X, Y, alpha, max_itr)

9.368653735228364 0.31
4.994251188297815 0.43
4.951873226767272 0.48
4.922370610237865 0.47
4.901694423284286 0.48


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Create model with regularization
model = LogisticRegression(
    penalty='l2',           # Regularization type
    C=1.0,                  # Inverse of regularization strength
    solver='lbfgs',         # Optimization algorithm
    max_iter=1000,          # Maximum iterations
    class_weight='balanced' # Handle imbalanced data
)

# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Evaluate
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Interpret coefficients
for feature, coef in zip(feature_names, model.coef_[0]):
    print(f"{feature}: {coef:.4f} (OR: {np.exp(coef):.4f})")