# Module 04: Logistic Regression

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- Understanding of probability and odds
- Basic calculus (derivatives)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand logistic regression for binary classification
2. Explain the sigmoid function and its properties
3. Interpret odds, log-odds, and probabilities
4. Visualize and understand decision boundaries
5. Implement multi-class classification (OvR and OvO)
6. Evaluate classification models using accuracy and confusion matrices
7. Apply logistic regression to real-world datasets

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import expit

# Scikit-learn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("All libraries imported successfully!")

## 2. From Linear to Logistic Regression

### Why Not Use Linear Regression for Classification?

**Problem**: Linear regression predicts continuous values, but classification needs discrete categories (0 or 1).

```
Linear Regression: y = β₀ + β₁x₁ + β₂x₂ + ...
Output: Any real number (-∞ to +∞)
```

**What we need**: Probabilities (0 to 1) that can be converted to class labels.

### Solution: The Sigmoid Function

**Logistic Regression** uses the **sigmoid function** to map any real number to (0, 1):

```
σ(z) = 1 / (1 + e^(-z))
```

Where z = β₀ + β₁x₁ + β₂x₂ + ... (same as linear regression)

**Properties**:
- Output always between 0 and 1
- S-shaped curve
- σ(0) = 0.5 (midpoint)
- σ(+∞) = 1
- σ(-∞) = 0

In [None]:
# Visualize the sigmoid function
z = np.linspace(-10, 10, 200)
sigmoid = expit(z)  # 1 / (1 + np.exp(-z))

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid, 'b-', linewidth=2, label='Sigmoid: σ(z) = 1/(1+e^-z)')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision threshold')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.7)
plt.xlabel('z (linear combination of features)', fontsize=12)
plt.ylabel('Probability P(y=1)', fontsize=12)
plt.title('Sigmoid Function: Mapping Real Numbers to Probabilities', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.ylim(-0.1, 1.1)
plt.tight_layout()
plt.show()

print("Key Points:")
print(f"  σ(-10) = {expit(-10):.6f} ≈ 0")
print(f"  σ(-2)  = {expit(-2):.6f}")
print(f"  σ(0)   = {expit(0):.6f} = 0.5")
print(f"  σ(2)   = {expit(2):.6f}")
print(f"  σ(10)  = {expit(10):.6f} ≈ 1")
print("\nIf P(y=1) ≥ 0.5 → predict class 1")
print("If P(y=1) < 0.5 → predict class 0")

## 3. Understanding Odds and Log-Odds

### Probability vs Odds

**Probability**: P(event) = successes / total attempts
- Range: [0, 1]
- Example: P(rain) = 0.7 = 70%

**Odds**: Odds(event) = P(event) / P(not event)
- Range: [0, ∞)
- Example: Odds(rain) = 0.7 / 0.3 = 2.33 ("7 to 3")

### Log-Odds (Logit)

**Log-Odds** = log(Odds) = log(P / (1-P))
- Range: (-∞, +∞)
- This is what logistic regression actually models!

```
log(P/(1-P)) = β₀ + β₁x₁ + β₂x₂ + ...
```

**Interpretation**: A unit increase in x increases log-odds by β.

In [None]:
# Demonstrate probability ↔ odds ↔ log-odds conversions
probabilities = np.array([0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
odds = probabilities / (1 - probabilities)
log_odds = np.log(odds)

conversion_df = pd.DataFrame({
    'Probability': probabilities,
    'Odds': odds,
    'Log-Odds': log_odds
})

print("Probability ↔ Odds ↔ Log-Odds Conversion Table:")
print(conversion_df.to_string(index=False))

print("\nKey Insights:")
print("  P = 0.5 → Odds = 1.0 (50-50 chance) → Log-Odds = 0")
print("  P > 0.5 → Odds > 1.0 → Log-Odds > 0")
print("  P < 0.5 → Odds < 1.0 → Log-Odds < 0")

## 4. Binary Classification Example

Let's build a binary classifier to predict if a tumor is malignant or benign.

In [None]:
# Load breast cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target  # 0 = malignant, 1 = benign

# Create DataFrame for exploration
cancer_df = pd.DataFrame(X, columns=cancer.feature_names)
cancer_df['diagnosis'] = y
cancer_df['diagnosis_name'] = cancer_df['diagnosis'].map({0: 'malignant', 1: 'benign'})

print("Breast Cancer Dataset:")
print(f"Shape: {cancer_df.shape}")
print(f"\nFeatures (first 10): {list(cancer.feature_names[:10])}")
print(f"\nClass distribution:")
print(cancer_df['diagnosis_name'].value_counts())
print(f"\nFirst few rows:")
print(cancer_df.head())

In [None]:
# Visualize two features
plt.figure(figsize=(10, 6))
for diagnosis, name in [(0, 'Malignant'), (1, 'Benign')]:
    mask = y == diagnosis
    plt.scatter(
        X[mask, 0], X[mask, 1],
        label=name, s=50, alpha=0.6, edgecolors='k'
    )

plt.xlabel(cancer.feature_names[0], fontsize=12)
plt.ylabel(cancer.feature_names[1], fontsize=12)
plt.title('Tumor Classification: Mean Radius vs Mean Texture', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice some separation between classes - good for classification!")

In [None]:
# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nClass distribution in training:")
print(pd.Series(y_train).value_counts())

In [None]:
# Train logistic regression model
log_reg = LogisticRegression(max_iter=10000, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = log_reg.predict(X_train_scaled)
y_test_pred = log_reg.predict(X_test_scaled)

# Get probability predictions
y_test_proba = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

# Evaluate
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print("Logistic Regression Results:")
print(f"Training Accuracy: {train_acc:.2%}")
print(f"Test Accuracy: {test_acc:.2%}")
print(f"\nModel Coefficients (first 5 features):")
for feature, coef in zip(cancer.feature_names[:5], log_reg.coef_[0][:5]):
    print(f"  {feature:30s}: {coef:8.3f}")

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix', fontsize=14)
plt.tight_layout()
plt.show()

print("Confusion Matrix Interpretation:")
print(f"  True Negatives (TN):  {cm[0,0]} (correctly predicted malignant)")
print(f"  False Positives (FP): {cm[0,1]} (malignant predicted as benign)")
print(f"  False Negatives (FN): {cm[1,0]} (benign predicted as malignant)")
print(f"  True Positives (TP):  {cm[1,1]} (correctly predicted benign)")

In [None]:
# Detailed classification report
print("Detailed Classification Report:")
print(classification_report(y_test, y_test_pred, 
                          target_names=['Malignant', 'Benign']))

## 5. Decision Boundaries

The **decision boundary** is where the model switches predictions (P=0.5).

For logistic regression, this boundary is **linear** (a straight line in 2D, a plane in 3D, etc.).

In [None]:
# Visualize decision boundary using first two features
# Train a simpler model with just 2 features for visualization
X_2d = X[:, :2]  # Use first 2 features only
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(
    X_2d, y, test_size=0.2, random_state=42, stratify=y
)

scaler_2d = StandardScaler()
X_train_2d_scaled = scaler_2d.fit_transform(X_train_2d)
X_test_2d_scaled = scaler_2d.transform(X_test_2d)

log_reg_2d = LogisticRegression(max_iter=10000)
log_reg_2d.fit(X_train_2d_scaled, y_train_2d)

# Create mesh for decision boundary
x_min, x_max = X_train_2d_scaled[:, 0].min() - 1, X_train_2d_scaled[:, 0].max() + 1
y_min, y_max = X_train_2d_scaled[:, 1].min() - 1, X_train_2d_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                    np.linspace(y_min, y_max, 200))

Z = log_reg_2d.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(12, 6))

# Contour plot showing probabilities
contour = plt.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.7)
plt.colorbar(contour, label='P(Benign)')

# Decision boundary (P=0.5)
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)

# Plot data points
for diagnosis, name, color in [(0, 'Malignant', 'red'), (1, 'Benign', 'blue')]:
    mask = y_train_2d == diagnosis
    plt.scatter(X_train_2d_scaled[mask, 0], X_train_2d_scaled[mask, 1],
              c=color, label=name, s=30, alpha=0.7, edgecolors='k')

plt.xlabel(f'{cancer.feature_names[0]} (scaled)', fontsize=12)
plt.ylabel(f'{cancer.feature_names[1]} (scaled)', fontsize=12)
plt.title('Logistic Regression Decision Boundary', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Key Observations:")
print("  • Black line = decision boundary (P = 0.5)")
print("  • Blue region = model predicts Benign (P > 0.5)")
print("  • Red region = model predicts Malignant (P < 0.5)")
print("  • Boundary is LINEAR (straight line)")

## 6. Probability Interpretation

Logistic regression gives **calibrated probabilities** - not just class predictions!

In [None]:
# Show probability predictions for first 10 test samples
sample_df = pd.DataFrame({
    'Actual': y_test[:10],
    'Predicted': y_test_pred[:10],
    'P(Benign)': y_test_proba[:10],
    'P(Malignant)': 1 - y_test_proba[:10],
    'Correct': y_test[:10] == y_test_pred[:10]
})

print("Sample Predictions with Probabilities:")
print(sample_df.to_string(index=False))

print("\nNote: Higher probability = more confident prediction")

In [None]:
# Visualize probability distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of predicted probabilities
axes[0].hist(y_test_proba[y_test == 0], bins=20, alpha=0.7, 
            label='Actual Malignant', color='red', edgecolor='k')
axes[0].hist(y_test_proba[y_test == 1], bins=20, alpha=0.7, 
            label='Actual Benign', color='blue', edgecolor='k')
axes[0].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Threshold')
axes[0].set_xlabel('Predicted Probability of Benign', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Predicted Probabilities', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scatter plot: actual vs probability
jitter = np.random.normal(0, 0.02, len(y_test))
axes[1].scatter(y_test_proba, y_test + jitter, alpha=0.5, edgecolors='k')
axes[1].axvline(x=0.5, color='r', linestyle='--', linewidth=2, label='Threshold')
axes[1].set_xlabel('Predicted Probability of Benign', fontsize=12)
axes[1].set_ylabel('Actual Class (0=Malignant, 1=Benign)', fontsize=12)
axes[1].set_title('Actual Class vs Predicted Probability', fontsize=14)
axes[1].set_yticks([0, 1])
axes[1].set_yticklabels(['Malignant', 'Benign'])
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Good separation → Confident predictions (close to 0 or 1)")

## 7. Multi-Class Classification

Logistic regression is naturally binary, but can handle multiple classes using:

### 1. One-vs-Rest (OvR) / One-vs-All
- Train N binary classifiers (one per class)
- Each classifier: "this class vs all others"
- Prediction: class with highest probability
- **Default in scikit-learn**

### 2. One-vs-One (OvO)
- Train N(N-1)/2 binary classifiers
- Each classifier: "class A vs class B"
- Prediction: class that wins most pairwise comparisons
- More classifiers, but each is simpler

In [None]:
# Multi-class example: Iris dataset (3 classes)
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

print("Iris Dataset for Multi-Class Classification:")
print(f"Classes: {iris.target_names}")
print(f"Training samples: {X_train_iris.shape[0]}")
print(f"Test samples: {X_test_iris.shape[0]}")

In [None]:
# Train with One-vs-Rest (default)
log_reg_ovr = LogisticRegression(multi_class='ovr', max_iter=10000, random_state=42)
log_reg_ovr.fit(X_train_iris_scaled, y_train_iris)
y_pred_ovr = log_reg_ovr.predict(X_test_iris_scaled)
acc_ovr = accuracy_score(y_test_iris, y_pred_ovr)

print("One-vs-Rest (OvR) Results:")
print(f"Accuracy: {acc_ovr:.2%}")
print(f"\nClassification Report:")
print(classification_report(y_test_iris, y_pred_ovr, target_names=iris.target_names))

In [None]:
# Confusion matrix for multi-class
cm_iris = confusion_matrix(y_test_iris, y_pred_ovr)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Blues',
           xticklabels=iris.target_names,
           yticklabels=iris.target_names)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Multi-Class Confusion Matrix (Iris)', fontsize=14)
plt.tight_layout()
plt.show()

print("Diagonal = correct predictions")
print("Off-diagonal = misclassifications")

In [None]:
# Show probability predictions for multi-class
y_proba_iris = log_reg_ovr.predict_proba(X_test_iris_scaled)

print("Multi-Class Probability Predictions (first 5 samples):")
proba_df = pd.DataFrame(
    y_proba_iris[:5],
    columns=[f'P({name})' for name in iris.target_names]
)
proba_df['Predicted'] = [iris.target_names[i] for i in y_pred_ovr[:5]]
proba_df['Actual'] = [iris.target_names[i] for i in y_test_iris[:5]]
print(proba_df.to_string(index=False))

print("\nNote: Sum of probabilities for each row = 1.0")
print("Predicted class = highest probability")

## 8. Practice Exercises

### Exercise 1: Adjust Decision Threshold

Using the breast cancer model:
1. Change the decision threshold from 0.5 to 0.3
2. How does this affect predictions?
3. Which errors increase/decrease (false positives vs false negatives)?

In [None]:
# Your code here


### Exercise 2: Feature Importance

For the breast cancer model:
1. Find the 5 features with largest coefficient magnitudes
2. What do these coefficients tell you about their importance?
3. Train a model using only these 5 features - how does accuracy compare?

In [None]:
# Your code here


### Exercise 3: Wine Dataset Classification

Load the wine dataset (`datasets.load_wine()`):
1. Train a multi-class logistic regression model
2. Calculate accuracy and create confusion matrix
3. Which classes are most confused with each other?

In [None]:
# Your code here


### Exercise 4: Coefficient Interpretation

Create a simple logistic regression with one feature.
If the coefficient is 2.5:
1. What happens to log-odds when feature increases by 1?
2. What happens to odds?
3. Demonstrate with actual predictions

In [None]:
# Your code here


## 9. Summary

### Key Concepts Learned

1. **Logistic Regression Fundamentals**:
   - Binary classification algorithm
   - Uses sigmoid function: σ(z) = 1/(1+e^(-z))
   - Outputs probabilities between 0 and 1
   - Linear decision boundary

2. **Sigmoid Function**:
   - Maps any real number to (0, 1)
   - S-shaped curve
   - Threshold at 0.5 for binary decisions

3. **Probability Concepts**:
   - **Probability**: P ∈ [0, 1]
   - **Odds**: P/(1-P) ∈ [0, ∞)
   - **Log-Odds**: log(P/(1-P)) ∈ (-∞, ∞)
   - Logistic regression models log-odds linearly

4. **Classification Metrics**:
   - Accuracy: overall correctness
   - Confusion matrix: detailed error breakdown
   - Precision, recall, F1-score (from classification_report)

5. **Multi-Class Classification**:
   - **One-vs-Rest (OvR)**: N binary classifiers
   - **One-vs-One (OvO)**: N(N-1)/2 pairwise classifiers
   - OvR is default in scikit-learn

6. **Model Outputs**:
   - `predict()`: Class labels (0/1 or category)
   - `predict_proba()`: Probability estimates
   - `decision_function()`: Raw scores (before sigmoid)

### When to Use Logistic Regression

✅ **Good for**:
- Binary classification (spam/not spam, fraud/legitimate)
- Multi-class classification (3+ categories)
- When you need probability estimates
- Interpretable models (coefficient = feature importance)
- Baseline classification model
- Linearly separable classes

❌ **Not ideal for**:
- Non-linear decision boundaries (use kernel methods or trees)
- Highly imbalanced datasets (adjust class_weight)
- Many correlated features (regularization needed)

### Quick Reference: Scikit-learn

```python
from sklearn.linear_model import LogisticRegression

# Binary classification
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)  # Class labels
y_proba = model.predict_proba(X_test)  # Probabilities

# Multi-class
model_multi = LogisticRegression(multi_class='ovr')  # or 'multinomial'

# Important parameters:
# - C: Inverse of regularization (higher = less regularization)
# - penalty: 'l1', 'l2', 'elasticnet', 'none'
# - solver: 'lbfgs', 'liblinear', 'saga'
# - class_weight: 'balanced' for imbalanced data
```

### Decision Tree: Linear vs Logistic Regression

| Aspect | Linear Regression | Logistic Regression |
|--------|------------------|--------------------|
| **Task** | Regression | Classification |
| **Output** | Continuous | Probability (0-1) |
| **Equation** | y = β₀ + β₁x | P = 1/(1+e^(-z)) |
| **Loss** | MSE | Log-loss |
| **Use Case** | Predict prices | Predict categories |

### Next Steps

In the next module, we'll explore:
- **Decision Trees** for non-linear boundaries
- Tree visualization and interpretation
- Handling overfitting with pruning
- Feature importance analysis

### Additional Resources

- [Scikit-learn Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- [StatQuest: Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8)
- [Andrew Ng: Logistic Regression](https://www.coursera.org/learn/machine-learning)
- [Understanding the Sigmoid Function](https://towardsdatascience.com/understanding-the-sigmoid-function-f0e6e0a7eca)