# Module 04: Logistic Regression

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- [Module 00: Introduction to ML and scikit-learn](00_introduction_to_ml_and_sklearn.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)
- [Module 02: Data Preparation and Train/Test Split](02_data_preparation_train_test_split.ipynb)
- [Module 03: Linear Regression](03_linear_regression.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand how logistic regression adapts linear regression for classification
2. Explain the sigmoid function and its role in converting to probabilities
3. Build binary classification models
4. Build multiclass classification models using One-vs-Rest strategy
5. Interpret probability predictions and decision thresholds
6. Evaluate classification models using accuracy and other metrics

## 1. From Linear to Logistic Regression

### The Problem with Linear Regression for Classification

Linear regression predicts continuous values, but classification needs discrete categories:
- Email: Spam (1) or Not Spam (0)
- Disease: Present (1) or Absent (0)
- Transaction: Fraud (1) or Legitimate (0)

**Problem**: Linear regression can predict values like 1.5, -0.3, or 100, which don't make sense for categories!

### The Solution: Logistic Regression

**Logistic Regression** uses the **sigmoid function** to transform linear regression output into probabilities between 0 and 1.

### The Sigmoid Function

```
σ(z) = 1 / (1 + e^(-z))
```

Where:
- **z** = linear combination: β₀ + β₁x₁ + β₂x₂ + ...
- **σ(z)** = probability between 0 and 1

**Key Properties**:
- Output always between 0 and 1
- S-shaped curve
- When z = 0, σ(z) = 0.5
- As z → ∞, σ(z) → 1
- As z → -∞, σ(z) → 0

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

In [None]:
# Visualize the sigmoid function
def sigmoid(z):
    """Calculate sigmoid function"""
    return 1 / (1 + np.exp(-z))

# Create range of z values
z_values = np.linspace(-10, 10, 200)
sigmoid_values = sigmoid(z_values)

plt.figure(figsize=(10, 6))
plt.plot(z_values, sigmoid_values, linewidth=3, color='blue')
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=2, label='Decision Threshold (0.5)')
plt.axvline(x=0, color='green', linestyle='--', linewidth=2, alpha=0.5)
plt.xlabel('z (linear combination)', fontsize=12)
plt.ylabel('σ(z) - Probability', fontsize=12)
plt.title('The Sigmoid Function\nConverts any value to probability [0, 1]', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=10)
plt.ylim(-0.1, 1.1)
plt.tight_layout()
plt.show()

print("Key Properties:")
print(f"- When z = -5: σ(z) = {sigmoid(-5):.4f} (very unlikely)")
print(f"- When z = 0: σ(z) = {sigmoid(0):.4f} (neutral)")
print(f"- When z = 5: σ(z) = {sigmoid(5):.4f} (very likely)")

## 2. Binary Classification with Logistic Regression

Let's predict whether a tumor is malignant (cancerous) or benign using the breast cancer dataset.

**Target**:
- 0 = Malignant (cancerous)
- 1 = Benign (non-cancerous)

In [None]:
# Load breast cancer dataset
cancer_df = pd.read_csv('data/sample/breast_cancer.csv')

print("Breast Cancer Dataset Overview:")
print(f"Shape: {cancer_df.shape}")
print(f"\nTarget distribution:")
print(cancer_df['diagnosis'].value_counts())
print(f"\nClass balance:")
print(cancer_df['diagnosis'].value_counts(normalize=True))
print(f"\nFirst few rows:")
cancer_df.head()

In [None]:
# Prepare features and target
# Drop non-feature columns
X = cancer_df.drop(['target', 'diagnosis'], axis=1)
y = cancer_df['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nNumber of features: {X.shape[1]}")
print(f"Feature names (first 5): {list(X.columns[:5])}")

In [None]:
# Split and scale the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split data (stratified to maintain class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data Preparation Complete:")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"Features scaled: ✓")

In [None]:
# Train logistic regression model
from sklearn.linear_model import LogisticRegression

# Create and train the model
log_reg = LogisticRegression(random_state=42, max_iter=10000)
log_reg.fit(X_train_scaled, y_train)

print("✓ Logistic Regression model trained!")
print(f"\nNumber of coefficients: {len(log_reg.coef_[0])}")
print(f"Intercept: {log_reg.intercept_[0]:.4f}")

In [None]:
# Make predictions
y_pred = log_reg.predict(X_test_scaled)

# Calculate accuracy
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

accuracy = accuracy_score(y_test, y_pred)

print("Model Performance:")
print(f"Accuracy: {accuracy:.1%}")
print(f"\nThis means the model correctly classifies {accuracy:.1%} of tumors!")

## 3. Probability Predictions

Unlike other classifiers, logistic regression provides **probability estimates** for each class. This is very useful in real-world applications!

In [None]:
# Get probability predictions
y_proba = log_reg.predict_proba(X_test_scaled)

# Display first 10 predictions with probabilities
predictions_df = pd.DataFrame({
    'Actual': y_test.values[:10],
    'Predicted': y_pred[:10],
    'Prob_Malignant': y_proba[:10, 0],
    'Prob_Benign': y_proba[:10, 1],
    'Confidence': np.max(y_proba[:10], axis=1)
})

print("Prediction Examples with Probabilities:")
print(predictions_df.to_string(index=False))
print("\nNote: Confidence = probability of predicted class")
print("Higher confidence = more certain prediction")

In [None]:
# Visualize probability distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of probabilities for class 1 (benign)
axes[0].hist(y_proba[y_test == 0, 1], bins=30, alpha=0.7, label='Actual: Malignant', color='red')
axes[0].hist(y_proba[y_test == 1, 1], bins=30, alpha=0.7, label='Actual: Benign', color='blue')
axes[0].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Decision Threshold')
axes[0].set_xlabel('Predicted Probability (Benign)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Predicted Probabilities', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Confidence distribution
confidence = np.max(y_proba, axis=1)
axes[1].hist(confidence, bins=30, color='green', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Prediction Confidence', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Model Confidence Distribution', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Average confidence: {confidence.mean():.1%}")
print(f"Min confidence: {confidence.min():.1%}")
print(f"Max confidence: {confidence.max():.1%}")

## 4. Decision Threshold

By default, logistic regression uses a **threshold of 0.5**:
- If probability ≥ 0.5 → Predict class 1
- If probability < 0.5 → Predict class 0

But we can adjust this threshold based on the problem requirements!

In [None]:
# Demonstrate different thresholds
thresholds = [0.3, 0.5, 0.7]

print("Effect of Different Decision Thresholds:")
print("="*60)

for threshold in thresholds:
    # Apply threshold manually
    y_pred_threshold = (y_proba[:, 1] >= threshold).astype(int)
    accuracy_threshold = accuracy_score(y_test, y_pred_threshold)
    
    # Count predictions
    n_positive = np.sum(y_pred_threshold == 1)
    n_negative = np.sum(y_pred_threshold == 0)
    
    print(f"\nThreshold: {threshold}")
    print(f"  Accuracy: {accuracy_threshold:.1%}")
    print(f"  Predicted as Benign (1): {n_positive}")
    print(f"  Predicted as Malignant (0): {n_negative}")

print("\nKey Insight:")
print("- Lower threshold → More predictions of class 1 (more sensitive)")
print("- Higher threshold → Fewer predictions of class 1 (more specific)")
print("- Choose threshold based on the cost of false positives vs false negatives")

## 5. Confusion Matrix

A confusion matrix shows the breakdown of correct and incorrect predictions:

```
                 Predicted
              Negative  Positive
Actual Negative   TN       FP
       Positive   FN       TP
```

Where:
- **TN** (True Negative): Correctly predicted negative
- **TP** (True Positive): Correctly predicted positive
- **FN** (False Negative): Incorrectly predicted negative (missed)
- **FP** (False Positive): Incorrectly predicted positive (false alarm)

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True,
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix\nBreakdown of Predictions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Extract values
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix Breakdown:")
print(f"True Negatives (TN): {tn} - Correctly identified malignant")
print(f"True Positives (TP): {tp} - Correctly identified benign")
print(f"False Negatives (FN): {fn} - Malignant classified as benign (dangerous!)")
print(f"False Positives (FP): {fp} - Benign classified as malignant (unnecessary worry)")

In [None]:
# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

print("\nMetrics Explanation:")
print("- Precision: Of all predicted positive, how many were correct?")
print("- Recall: Of all actual positive, how many did we find?")
print("- F1-Score: Harmonic mean of precision and recall")
print("- Support: Number of samples in each class")

## 6. Multiclass Classification

Logistic regression can also handle **multiple classes** using the **One-vs-Rest (OvR)** strategy:
- Train one binary classifier for each class
- Class A vs (B and C)
- Class B vs (A and C)
- Class C vs (A and B)
- Choose the class with highest probability

Let's classify Iris species (3 classes).

In [None]:
# Load Iris dataset
iris_df = pd.read_csv('data/sample/iris.csv')

# Prepare features and target
feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X_iris = iris_df[feature_cols]
y_iris = iris_df['species']

print("Iris Dataset - Multiclass Classification:")
print(f"Number of samples: {len(X_iris)}")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of classes: {y_iris.nunique()}")
print(f"\nClass distribution:")
print(y_iris.value_counts().sort_index())

In [None]:
# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train multiclass logistic regression
log_reg_multi = LogisticRegression(random_state=42, max_iter=10000, multi_class='ovr')
log_reg_multi.fit(X_train_iris_scaled, y_train_iris)

print("✓ Multiclass Logistic Regression trained!")
print(f"\nNumber of classes: {len(log_reg_multi.classes_)}")
print(f"Classes: {log_reg_multi.classes_}")

In [None]:
# Make predictions
y_pred_iris = log_reg_multi.predict(X_test_iris_scaled)
y_proba_iris = log_reg_multi.predict_proba(X_test_iris_scaled)

# Evaluate
accuracy_iris = accuracy_score(y_test_iris, y_pred_iris)

print(f"Multiclass Classification Accuracy: {accuracy_iris:.1%}")

# Show example predictions with probabilities for all classes
multi_pred_df = pd.DataFrame({
    'Actual': y_test_iris.values[:8],
    'Predicted': y_pred_iris[:8],
    'Prob_Class_0': y_proba_iris[:8, 0],
    'Prob_Class_1': y_proba_iris[:8, 1],
    'Prob_Class_2': y_proba_iris[:8, 2]
})

print("\nExample Multiclass Predictions:")
print(multi_pred_df.to_string(index=False))
print("\nNote: Each sample has probability for all 3 classes (sum = 1)")

In [None]:
# Confusion matrix for multiclass
cm_iris = confusion_matrix(y_test_iris, y_pred_iris)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Greens', cbar=True,
           xticklabels=['Setosa', 'Versicolor', 'Virginica'],
           yticklabels=['Setosa', 'Versicolor', 'Virginica'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Multiclass Confusion Matrix\nIris Species Classification', 
         fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Classification Report:")
print(classification_report(y_test_iris, y_pred_iris, 
                          target_names=['Setosa', 'Versicolor', 'Virginica']))

## Exercises

Practice building and evaluating logistic regression models.

### Exercise 1: Binary Classification on Wine Quality

Using the wine dataset, create a binary classification problem:

Steps:
1. Load the wine dataset from 'data/sample/wine.csv'
2. Create a binary target: class 0 vs (class 1 and 2 combined)
3. Split data (70/30, stratified)
4. Scale features
5. Train a LogisticRegression model
6. Calculate and print accuracy
7. Display the confusion matrix

In [None]:
# Your code here


### Exercise 2: Probability Interpretation

Using the breast cancer model (log_reg) we trained:

1. Find the sample with the HIGHEST confidence prediction
2. Find the sample with the LOWEST confidence prediction
3. Print both samples with their probabilities
4. What do these confidence levels tell us?

Hint: Use y_proba and np.max() to find confidence scores.

In [None]:
# Your code here


### Exercise 3: Threshold Tuning

For the breast cancer model, experiment with different thresholds:

1. Try thresholds: [0.2, 0.4, 0.5, 0.6, 0.8]
2. For each threshold, calculate:
   - Number of false negatives (FN)
   - Number of false positives (FP)
3. Create a plot showing FN and FP vs threshold
4. Which threshold would you choose for cancer detection and why?

Think: In cancer detection, which is worse - FN or FP?

In [None]:
# Your code here


### Exercise 4: Feature Importance in Logistic Regression

Examine which features are most important for predicting breast cancer:

1. Get the coefficients from log_reg.coef_[0]
2. Create a DataFrame with feature names and coefficients
3. Sort by absolute value of coefficients
4. Visualize the top 10 most important features (horizontal bar plot)
5. Interpret: What do positive/negative coefficients mean?

In [None]:
# Your code here


## Summary

Congratulations! You've mastered logistic regression for classification tasks.

### Key Concepts

1. **Logistic Regression**:
   - Adapts linear regression for classification
   - Uses sigmoid function to convert to probabilities
   - Output always between 0 and 1
   - Default threshold: 0.5 for binary classification

2. **Sigmoid Function**:
   - Formula: σ(z) = 1 / (1 + e^(-z))
   - S-shaped curve
   - Maps any value to probability [0, 1]
   - Critical for converting linear output to probabilities

3. **Binary Classification**:
   - Two classes: 0 and 1
   - Predicts probability of class 1
   - Threshold determines final class
   - Examples: spam detection, disease diagnosis, fraud detection

4. **Probability Predictions**:
   - predict_proba() returns probabilities for each class
   - Useful for risk assessment and ranking
   - Confidence = max probability
   - Can adjust threshold based on cost of errors

5. **Multiclass Classification**:
   - One-vs-Rest (OvR) strategy
   - One binary classifier per class
   - Choose class with highest probability
   - Probabilities sum to 1 across all classes

6. **Evaluation Metrics**:
   - **Accuracy**: Overall correctness
   - **Confusion Matrix**: Breakdown of predictions (TP, TN, FP, FN)
   - **Precision**: Of predicted positive, how many correct?
   - **Recall**: Of actual positive, how many found?
   - **F1-Score**: Balance of precision and recall

7. **Best Practices**:
   - Always scale features for logistic regression
   - Use stratified split for classification
   - Examine probability distributions
   - Consider cost of FP vs FN when choosing threshold
   - Use confusion matrix to understand errors

### When to Use Logistic Regression

**Good for:**
- Binary or multiclass classification
- Need probability estimates
- Interpretable models (coefficients show feature importance)
- Linearly separable classes
- Baseline classification models

**Not good for:**
- Non-linear decision boundaries (use SVM with kernels or neural networks)
- Very large feature spaces (consider regularization)
- Regression problems (use linear regression instead)

### What's Next?

In **Module 05: Decision Trees**, you'll learn:
- How decision trees make classifications
- Understanding tree depth and complexity
- Visualizing decision trees
- Feature importance in tree-based models
- Advantages and disadvantages of trees

### Additional Resources

- [Logistic Regression - StatQuest](https://www.youtube.com/watch?v=yIYKR4sgzI8)
- [scikit-learn Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- [Understanding the Sigmoid Function](https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e)