# Logistic Regression - Complete Guide

## From Theory to Implementation with Visualizations

Logistic Regression is one of the most fundamental algorithms for **binary classification**. Despite its name, it's a classification algorithm, not regression.

### What You'll Learn
1. Mathematical foundations (sigmoid, log-odds)
2. Cost function and gradient descent
3. Implementation from scratch
4. Scikit-learn implementation
5. Multiclass classification
6. Regularization (L1, L2)
7. Model evaluation and interpretation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,
                             roc_curve, auc, precision_recall_curve)

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. The Sigmoid Function

The sigmoid function maps any real number to a value between 0 and 1:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

This is crucial for converting linear predictions to probabilities.

In [None]:
def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

# Visualize sigmoid function
z = np.linspace(-10, 10, 200)
sig = sigmoid(z)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sigmoid curve
axes[0].plot(z, sig, 'b-', linewidth=2)
axes[0].axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision boundary (0.5)')
axes[0].axvline(x=0, color='g', linestyle='--', alpha=0.7)
axes[0].fill_between(z, 0, sig, where=(z > 0), alpha=0.3, color='green', label='Class 1')
axes[0].fill_between(z, 0, sig, where=(z <= 0), alpha=0.3, color='red', label='Class 0')
axes[0].set_xlabel('z (linear combination)', fontsize=12)
axes[0].set_ylabel('σ(z) = P(y=1|x)', fontsize=12)
axes[0].set_title('Sigmoid Function', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Derivative of sigmoid
sig_derivative = sig * (1 - sig)
axes[1].plot(z, sig_derivative, 'purple', linewidth=2)
axes[1].set_xlabel('z', fontsize=12)
axes[1].set_ylabel("σ'(z)", fontsize=12)
axes[1].set_title('Derivative of Sigmoid: σ(z)(1-σ(z))', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key properties of sigmoid:")
print(f"σ(0) = {sigmoid(0):.3f}")
print(f"σ(large +ve) → {sigmoid(10):.6f}")
print(f"σ(large -ve) → {sigmoid(-10):.6f}")

## 2. Logistic Regression Model

The model predicts the probability of class 1:

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

### Log-Odds (Logit) Interpretation

$$\log\left(\frac{P(y=1)}{P(y=0)}\right) = \mathbf{w}^T\mathbf{x} + b$$

The linear combination represents the **log-odds** of the positive class.

In [None]:
# Generate binary classification data
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1,
                           class_sep=1.5, random_state=42)

# Visualize the data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', s=100)
plt.colorbar(scatter, label='Class')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Binary Classification Dataset', fontsize=14)
plt.show()

## 3. Cost Function: Binary Cross-Entropy

We cannot use MSE for classification (non-convex). Instead, we use **log loss**:

$$J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]$$

This penalizes confident wrong predictions heavily.

In [None]:
# Visualize the cost function behavior
p = np.linspace(0.001, 0.999, 100)

# Cost when y=1: -log(p)
cost_y1 = -np.log(p)
# Cost when y=0: -log(1-p)
cost_y0 = -np.log(1 - p)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(p, cost_y1, 'b-', linewidth=2, label='y=1: -log(p)')
axes[0].set_xlabel('Predicted Probability p', fontsize=12)
axes[0].set_ylabel('Cost', fontsize=12)
axes[0].set_title('Cost when Actual Class = 1', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].annotate('Low cost when p→1', xy=(0.9, 0.1), fontsize=10)
axes[0].annotate('High cost when p→0', xy=(0.1, 2), fontsize=10)

axes[1].plot(p, cost_y0, 'r-', linewidth=2, label='y=0: -log(1-p)')
axes[1].set_xlabel('Predicted Probability p', fontsize=12)
axes[1].set_ylabel('Cost', fontsize=12)
axes[1].set_title('Cost when Actual Class = 0', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].annotate('Low cost when p→0', xy=(0.1, 0.1), fontsize=10)
axes[1].annotate('High cost when p→1', xy=(0.8, 2), fontsize=10)

plt.tight_layout()
plt.show()

## 4. Implementation from Scratch

In [None]:
class LogisticRegressionScratch:
    """Logistic Regression implementation from scratch"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
        
    def fit(self, X, y):
        """Train the model using gradient descent"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = sigmoid(linear_pred)
            
            # Compute cost
            cost = (-1/n_samples) * np.sum(
                y * np.log(predictions + 1e-15) + 
                (1 - y) * np.log(1 - predictions + 1e-15)
            )
            self.cost_history.append(cost)
            
            # Compute gradients
            dw = (1/n_samples) * np.dot(X.T, (predictions - y))
            db = (1/n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
        return self
    
    def predict_proba(self, X):
        """Predict probabilities"""
        linear_pred = np.dot(X, self.weights) + self.bias
        return sigmoid(linear_pred)
    
    def predict(self, X, threshold=0.5):
        """Predict class labels"""
        return (self.predict_proba(X) >= threshold).astype(int)

# Train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train our model
model_scratch = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)
model_scratch.fit(X_train_scaled, y_train)

# Predictions
y_pred = model_scratch.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy (from scratch): {accuracy:.4f}")
print(f"Weights: {model_scratch.weights}")
print(f"Bias: {model_scratch.bias:.4f}")

In [None]:
# Visualize training progress and decision boundary
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cost history
axes[0].plot(model_scratch.cost_history, 'b-', linewidth=2)
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Cost', fontsize=12)
axes[0].set_title('Training Progress: Cost vs Iterations', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Decision boundary
h = 0.02
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = model_scratch.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, levels=50, cmap='RdYlBu', alpha=0.8)
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, 
                cmap='RdYlBu', edgecolors='black', s=100)
axes[1].contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
axes[1].set_xlabel('Feature 1 (scaled)', fontsize=12)
axes[1].set_ylabel('Feature 2 (scaled)', fontsize=12)
axes[1].set_title('Decision Boundary with Probability Contours', fontsize=14)

plt.tight_layout()
plt.show()

## 5. Scikit-learn Implementation

In [None]:
# Using sklearn's LogisticRegression
model_sklearn = LogisticRegression(max_iter=1000, random_state=42)
model_sklearn.fit(X_train_scaled, y_train)

y_pred_sklearn = model_sklearn.predict(X_test_scaled)
y_prob_sklearn = model_sklearn.predict_proba(X_test_scaled)[:, 1]

print(f"Accuracy (sklearn): {accuracy_score(y_test, y_pred_sklearn):.4f}")
print(f"\nCoefficients: {model_sklearn.coef_[0]}")
print(f"Intercept: {model_sklearn.intercept_[0]:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_sklearn))

## 6. Model Evaluation: ROC Curve and Precision-Recall

In [None]:
# ROC Curve and Precision-Recall Curve
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_sklearn)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
axes[0].set_title('Confusion Matrix', fontsize=14)

# ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob_sklearn)
roc_auc = auc(fpr, tpr)

axes[1].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'r--', linewidth=1, label='Random Classifier')
axes[1].fill_between(fpr, tpr, alpha=0.3)
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title('ROC Curve', fontsize=14)
axes[1].legend(loc='lower right')
axes[1].grid(True, alpha=0.3)

# Precision-Recall Curve
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_prob_sklearn)

axes[2].plot(recall, precision, 'g-', linewidth=2)
axes[2].fill_between(recall, precision, alpha=0.3, color='green')
axes[2].set_xlabel('Recall', fontsize=12)
axes[2].set_ylabel('Precision', fontsize=12)
axes[2].set_title('Precision-Recall Curve', fontsize=14)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Multiclass Classification

Logistic Regression can be extended to multiple classes using:
- **One-vs-Rest (OvR)**: Train K binary classifiers
- **Multinomial (Softmax)**: Direct multiclass probability

In [None]:
# Load Iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train multinomial logistic regression
model_multi = LogisticRegression(multi_class='multinomial', max_iter=1000, random_state=42)
model_multi.fit(X_train_iris_scaled, y_train_iris)

# Predictions
y_pred_iris = model_multi.predict(X_test_iris_scaled)
y_prob_iris = model_multi.predict_proba(X_test_iris_scaled)

print(f"Accuracy: {accuracy_score(y_test_iris, y_pred_iris):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=class_names))

In [None]:
# Visualize multiclass results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix
cm_iris = confusion_matrix(y_test_iris, y_pred_iris)
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=class_names, yticklabels=class_names)
axes[0].set_xlabel('Predicted', fontsize=12)
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_title('Multiclass Confusion Matrix', fontsize=14)

# Feature importance (coefficients)
coef_df = pd.DataFrame(model_multi.coef_, columns=feature_names, index=class_names)
coef_df.T.plot(kind='bar', ax=axes[1], colormap='viridis')
axes[1].set_xlabel('Features', fontsize=12)
axes[1].set_ylabel('Coefficient Value', fontsize=12)
axes[1].set_title('Feature Coefficients by Class', fontsize=14)
axes[1].legend(title='Class')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 8. Regularization: L1 (Lasso) and L2 (Ridge)

Regularization prevents overfitting by adding penalty to the cost function:

- **L2 (Ridge)**: $J(\mathbf{w}) + \lambda\sum w_j^2$ → shrinks coefficients
- **L1 (Lasso)**: $J(\mathbf{w}) + \lambda\sum |w_j|$ → sparse coefficients (feature selection)

In [None]:
# Load breast cancer dataset (high-dimensional)
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.3, random_state=42
)

scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)

# Compare different regularization strengths
C_values = [0.001, 0.01, 0.1, 1, 10, 100]  # C = 1/λ

results = {'C': [], 'L1_train': [], 'L1_test': [], 'L2_train': [], 'L2_test': [],
           'L1_nonzero': [], 'L2_nonzero': []}

for C in C_values:
    # L1 regularization
    model_l1 = LogisticRegression(penalty='l1', C=C, solver='saga', max_iter=5000)
    model_l1.fit(X_train_c_scaled, y_train_c)
    
    # L2 regularization
    model_l2 = LogisticRegression(penalty='l2', C=C, max_iter=5000)
    model_l2.fit(X_train_c_scaled, y_train_c)
    
    results['C'].append(C)
    results['L1_train'].append(model_l1.score(X_train_c_scaled, y_train_c))
    results['L1_test'].append(model_l1.score(X_test_c_scaled, y_test_c))
    results['L2_train'].append(model_l2.score(X_train_c_scaled, y_train_c))
    results['L2_test'].append(model_l2.score(X_test_c_scaled, y_test_c))
    results['L1_nonzero'].append(np.sum(model_l1.coef_ != 0))
    results['L2_nonzero'].append(np.sum(model_l2.coef_ != 0))

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

In [None]:
# Visualize regularization effects
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy vs C
axes[0].semilogx(results['C'], results['L1_test'], 'b-o', label='L1 Test', linewidth=2)
axes[0].semilogx(results['C'], results['L2_test'], 'r-s', label='L2 Test', linewidth=2)
axes[0].semilogx(results['C'], results['L1_train'], 'b--', alpha=0.5, label='L1 Train')
axes[0].semilogx(results['C'], results['L2_train'], 'r--', alpha=0.5, label='L2 Train')
axes[0].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Regularization Effect on Accuracy', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Number of non-zero coefficients
axes[1].semilogx(results['C'], results['L1_nonzero'], 'b-o', label='L1', linewidth=2)
axes[1].semilogx(results['C'], results['L2_nonzero'], 'r-s', label='L2', linewidth=2)
axes[1].axhline(y=30, color='gray', linestyle='--', label='Total features')
axes[1].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[1].set_ylabel('Number of Non-Zero Coefficients', fontsize=12)
axes[1].set_title('Feature Sparsity: L1 vs L2', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Feature Importance and Interpretation

Coefficients in logistic regression have a clear interpretation:
- **Sign**: Direction of effect on probability
- **Magnitude**: Strength of effect
- **Exp(coef)**: Odds ratio

In [None]:
# Train final model with optimal C
best_model = LogisticRegression(penalty='l2', C=1.0, max_iter=5000)
best_model.fit(X_train_c_scaled, y_train_c)

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': cancer.feature_names,
    'Coefficient': best_model.coef_[0],
    'Abs_Coefficient': np.abs(best_model.coef_[0]),
    'Odds_Ratio': np.exp(best_model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

In [None]:
# Visualize feature importance
top_features = feature_importance.head(15)

plt.figure(figsize=(12, 8))
colors = ['green' if x > 0 else 'red' for x in top_features['Coefficient']]
plt.barh(top_features['Feature'], top_features['Coefficient'], color=colors)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Top 15 Feature Coefficients\n(Green = increases malignant probability, Red = decreases)', fontsize=14)
plt.axvline(x=0, color='black', linewidth=0.5)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 10. Threshold Tuning

The default threshold is 0.5, but you can adjust it based on the cost of false positives vs false negatives.

In [None]:
# Analyze different thresholds
y_prob_cancer = best_model.predict_proba(X_test_c_scaled)[:, 1]

thresholds = np.arange(0.1, 0.9, 0.05)
metrics = {'threshold': [], 'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

for thresh in thresholds:
    y_pred_thresh = (y_prob_cancer >= thresh).astype(int)
    
    from sklearn.metrics import precision_score, recall_score, f1_score
    
    metrics['threshold'].append(thresh)
    metrics['accuracy'].append(accuracy_score(y_test_c, y_pred_thresh))
    metrics['precision'].append(precision_score(y_test_c, y_pred_thresh))
    metrics['recall'].append(recall_score(y_test_c, y_pred_thresh))
    metrics['f1'].append(f1_score(y_test_c, y_pred_thresh))

plt.figure(figsize=(12, 6))
plt.plot(metrics['threshold'], metrics['accuracy'], 'b-o', label='Accuracy', linewidth=2)
plt.plot(metrics['threshold'], metrics['precision'], 'g-s', label='Precision', linewidth=2)
plt.plot(metrics['threshold'], metrics['recall'], 'r-^', label='Recall', linewidth=2)
plt.plot(metrics['threshold'], metrics['f1'], 'purple', marker='d', label='F1 Score', linewidth=2)
plt.axvline(x=0.5, color='gray', linestyle='--', label='Default threshold')
plt.xlabel('Classification Threshold', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Metrics vs Classification Threshold', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Summary

### Key Takeaways

1. **Logistic Regression** uses the sigmoid function to model probabilities
2. **Binary Cross-Entropy** is the appropriate loss function
3. **Gradient Descent** optimizes the parameters
4. **Regularization** (L1/L2) prevents overfitting and enables feature selection
5. **Multiclass** extension uses softmax (multinomial) or OvR
6. **Interpretation**: Coefficients represent log-odds changes

### When to Use Logistic Regression

**Use when:**
- You need a fast, interpretable model
- Features are roughly linearly separable
- You need probability outputs
- Feature importance is needed

**Avoid when:**
- Complex non-linear relationships exist
- Very high-dimensional sparse data (use specialized methods)

### Practice Problems

1. Implement mini-batch gradient descent for logistic regression
2. Add L2 regularization to the scratch implementation
3. Compare OvR vs Multinomial on a 5-class problem
4. Find the optimal threshold for a medical diagnosis problem