# Logistic Regression - Complete Guide

## üìö Learning Objectives
- Understand logistic regression for binary and multi-class classification
- Implement logistic regression with scikit-learn
- Interpret coefficients and odds ratios
- Handle class imbalance
- Evaluate classification performance
- Apply regularization techniques

## üéØ What is Logistic Regression?

**Logistic Regression** is a statistical method for binary classification that models the probability of a binary outcome.

### Key Concepts:
- Uses **sigmoid function** to map predictions to probabilities (0 to 1)
- Despite the name, it's a **classification** algorithm, not regression
- Linear decision boundary
- Probabilistic interpretation

### Sigmoid Function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where: $z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

### When to Use:
‚úÖ Binary classification (Yes/No, 0/1)  
‚úÖ Need probability estimates  
‚úÖ Interpretable model required  
‚úÖ Linear decision boundary acceptable  
‚úÖ Baseline model for comparison  

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")

## Part 1: Binary Classification
### 1Ô∏è‚É£ Load and Explore Data

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target, name='diagnosis')

print(f"Dataset shape: {X.shape}")
print(f"\nTarget classes: {cancer.target_names}")
print(f"\nClass distribution:")
print(y.value_counts())
print(f"\nClass balance: {y.value_counts(normalize=True)}")

# Display first few rows
df = pd.concat([X, y], axis=1)
df.head()

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Class distribution
y.value_counts().plot(kind='bar', ax=axes[0], color=['salmon', 'skyblue'], edgecolor='black')
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(cancer.target_names, rotation=0)

# Feature correlation with target
correlations = X.corrwith(y).abs().sort_values(ascending=False).head(10)
correlations.plot(kind='barh', ax=axes[1], color='lightgreen', edgecolor='black')
axes[1].set_title('Top 10 Features Correlated with Target', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Absolute Correlation')

plt.tight_layout()
plt.show()

### 2Ô∏è‚É£ Train-Test Split and Scaling

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining class distribution:")
print(y_train.value_counts())
print(f"\nTest class distribution:")
print(y_test.value_counts())

### 3Ô∏è‚É£ Basic Logistic Regression Model

In [None]:
# Create pipeline with scaling
logreg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42, max_iter=10000))
])

# Train model
print("Training Logistic Regression model...")
logreg_pipeline.fit(X_train, y_train)
print("‚úÖ Training complete!")

# Make predictions
y_pred = logreg_pipeline.predict(X_test)
y_pred_proba = logreg_pipeline.predict_proba(X_test)[:, 1]

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nüìä Model Performance:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

print(f"\nüìã Classification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

### 4Ô∏è‚É£ Confusion Matrix and ROC Curve

In [None]:
# Create visualizations
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=cancer.target_names,
            yticklabels=cancer.target_names,
            cbar_kws={'label': 'Count'})
axes[0].set_xlabel('Predicted Label', fontsize=12)
axes[0].set_ylabel('True Label', fontsize=12)
axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 5Ô∏è‚É£ Feature Importance (Coefficients)

In [None]:
# Get coefficients
coefficients = logreg_pipeline.named_steps['classifier'].coef_[0]
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': coefficients,
    'abs_coefficient': np.abs(coefficients)
}).sort_values('abs_coefficient', ascending=False)

print("\nüîç Top 10 Most Important Features:")
print(feature_importance.head(10))

# Visualize coefficients
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(15)
colors = ['red' if x < 0 else 'green' for x in top_features['coefficient']]
plt.barh(range(len(top_features)), top_features['coefficient'], color=colors, edgecolor='black')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Top 15 Feature Coefficients\n(Green: Positive impact, Red: Negative impact)', 
          fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

# Interpret coefficients
print("\nüí° Coefficient Interpretation:")
print("- Positive coefficient: Increases probability of malignant (class 1)")
print("- Negative coefficient: Decreases probability of malignant (class 1)")
print("- Larger absolute value: Stronger influence on prediction")

### 6Ô∏è‚É£ Probability Calibration

In [None]:
# Analyze prediction probabilities
prob_df = pd.DataFrame({
    'true_label': y_test,
    'predicted_prob': y_pred_proba,
    'predicted_label': y_pred
})

# Visualize probability distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Probability distribution by true class
for class_label in [0, 1]:
    class_probs = prob_df[prob_df['true_label'] == class_label]['predicted_prob']
    axes[0].hist(class_probs, bins=30, alpha=0.6, label=cancer.target_names[class_label], edgecolor='black')

axes[0].set_xlabel('Predicted Probability', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Probability Distribution by True Class', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Probability vs Prediction
axes[1].scatter(range(len(y_test)), y_pred_proba, 
               c=y_test, cmap='RdYlGn', alpha=0.6, edgecolors='black')
axes[1].axhline(y=0.5, color='red', linestyle='--', linewidth=2, label='Decision Threshold (0.5)')
axes[1].set_xlabel('Sample Index', fontsize=12)
axes[1].set_ylabel('Predicted Probability', fontsize=12)
axes[1].set_title('Predicted Probabilities (colored by true class)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Probability Statistics:")
print(f"Mean probability for class 0: {prob_df[prob_df['true_label']==0]['predicted_prob'].mean():.4f}")
print(f"Mean probability for class 1: {prob_df[prob_df['true_label']==1]['predicted_prob'].mean():.4f}")

### 7Ô∏è‚É£ Regularization (L1 and L2)

In [None]:
# Compare different regularization strengths
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
penalties = ['l1', 'l2']

results = []

for penalty in penalties:
    for C in C_values:
        # Create model
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', LogisticRegression(
                penalty=penalty, 
                C=C, 
                solver='liblinear' if penalty == 'l1' else 'lbfgs',
                random_state=42,
                max_iter=10000
            ))
        ])
        
        # Train and evaluate
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        
        # Count non-zero coefficients
        coef = model.named_steps['classifier'].coef_[0]
        non_zero = np.sum(np.abs(coef) > 1e-5)
        
        results.append({
            'Penalty': penalty,
            'C': C,
            'Accuracy': accuracy,
            'Non-zero Features': non_zero
        })

results_df = pd.DataFrame(results)
print("\nüìä Regularization Comparison:")
print(results_df)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

for penalty in penalties:
    data = results_df[results_df['Penalty'] == penalty]
    axes[0].plot(data['C'], data['Accuracy'], marker='o', linewidth=2, label=penalty.upper())
    axes[1].plot(data['C'], data['Non-zero Features'], marker='o', linewidth=2, label=penalty.upper())

axes[0].set_xscale('log')
axes[0].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Accuracy vs Regularization Strength', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xscale('log')
axes[1].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[1].set_ylabel('Number of Non-zero Features', fontsize=12)
axes[1].set_title('Feature Selection via Regularization', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Regularization Insights:")
print("- L1 (Lasso): Performs feature selection (some coefficients become exactly 0)")
print("- L2 (Ridge): Shrinks coefficients but keeps all features")
print("- Smaller C: Stronger regularization (simpler model)")
print("- Larger C: Weaker regularization (more complex model)")

### 8Ô∏è‚É£ Handling Class Imbalance

In [None]:
# Create imbalanced dataset
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42
)

print(f"Imbalanced dataset class distribution:")
print(pd.Series(y_imb).value_counts())
print(f"\nClass balance: {pd.Series(y_imb).value_counts(normalize=True)}")

# Split data
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

# Compare different class weight strategies
strategies = [
    ('No Weighting', None),
    ('Balanced', 'balanced'),
    ('Custom {0:1, 1:9}', {0: 1, 1: 9})
]

results_imb = []

for name, class_weight in strategies:
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            class_weight=class_weight,
            random_state=42,
            max_iter=10000
        ))
    ])
    
    model.fit(X_train_imb, y_train_imb)
    y_pred = model.predict(X_test_imb)
    
    results_imb.append({
        'Strategy': name,
        'Accuracy': accuracy_score(y_test_imb, y_pred),
        'Precision (Class 1)': precision_score(y_test_imb, y_pred),
        'Recall (Class 1)': recall_score(y_test_imb, y_pred),
        'F1 (Class 1)': f1_score(y_test_imb, y_pred)
    })

results_imb_df = pd.DataFrame(results_imb)
print("\nüìä Class Imbalance Handling Comparison:")
print(results_imb_df.to_string(index=False))

# Visualize
results_imb_df.set_index('Strategy')[['Precision (Class 1)', 'Recall (Class 1)', 'F1 (Class 1)']].plot(
    kind='bar', figsize=(12, 6), edgecolor='black'
)
plt.title('Impact of Class Weighting on Minority Class Performance', fontsize=14, fontweight='bold')
plt.ylabel('Score', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Metric')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nüí° Class Imbalance Insights:")
print("- 'balanced': Automatically adjusts weights inversely proportional to class frequencies")
print("- Custom weights: Fine-tune based on business requirements")
print("- Trade-off: Higher recall often means lower precision")

## Part 2: Multi-class Classification
### 9Ô∏è‚É£ Multi-class Logistic Regression

In [None]:
from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X_multi = pd.DataFrame(iris.data, columns=iris.feature_names)
y_multi = pd.Series(iris.target, name='species')

print(f"Multi-class dataset shape: {X_multi.shape}")
print(f"\nClasses: {iris.target_names}")
print(f"\nClass distribution:")
print(y_multi.value_counts().sort_index())

# Split data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42, stratify=y_multi
)

# Train multi-class logistic regression
# multi_class='multinomial' uses softmax, 'ovr' uses one-vs-rest
multi_strategies = ['ovr', 'multinomial']

for strategy in multi_strategies:
    print(f"\n{'='*50}")
    print(f"Strategy: {strategy.upper()}")
    print(f"{'='*50}")
    
    model_multi = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            multi_class=strategy,
            random_state=42,
            max_iter=10000
        ))
    ])
    
    model_multi.fit(X_train_multi, y_train_multi)
    y_pred_multi = model_multi.predict(X_test_multi)
    
    accuracy = accuracy_score(y_test_multi, y_pred_multi)
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test_multi, y_pred_multi, target_names=iris.target_names))

# Confusion matrix for multinomial
cm_multi = confusion_matrix(y_test_multi, y_pred_multi)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names,
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Multi-class Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## üìä Key Takeaways

### Logistic Regression Advantages:
‚úÖ **Interpretable**: Clear coefficient interpretation  
‚úÖ **Probabilistic**: Provides probability estimates  
‚úÖ **Fast**: Quick to train and predict  
‚úÖ **Regularization**: Built-in L1/L2 regularization  
‚úÖ **Multi-class**: Handles multi-class classification  
‚úÖ **Baseline**: Excellent baseline model  

### Best Practices:
1. **Always scale features** - Logistic regression is sensitive to feature scales
2. **Check for multicollinearity** - Highly correlated features affect interpretation
3. **Use regularization** - Prevents overfitting, especially with many features
4. **Handle class imbalance** - Use class_weight parameter
5. **Evaluate with multiple metrics** - Accuracy alone can be misleading

### When to Use:
‚úÖ Need interpretable model  
‚úÖ Probability estimates required  
‚úÖ Linear decision boundary acceptable  
‚úÖ Baseline for comparison  
‚úÖ Feature selection (L1 regularization)  

### When NOT to Use:
‚ùå Non-linear decision boundaries  
‚ùå Complex feature interactions  
‚ùå Very high-dimensional data (use deep learning)  
‚ùå When maximum accuracy is critical (try ensemble methods)  

### Regularization Guide:
- **L1 (Lasso)**: Use for feature selection, sparse models
- **L2 (Ridge)**: Use for general regularization, keeps all features
- **C parameter**: Smaller C = stronger regularization

### Class Imbalance:
- Use `class_weight='balanced'` for automatic weighting
- Custom weights for business-specific requirements
- Monitor precision and recall, not just accuracy

### Multi-class Strategies:
- **OVR (One-vs-Rest)**: Faster, works well in practice
- **Multinomial**: More accurate, uses softmax

### Next Steps:
1. Try different regularization strengths
2. Experiment with feature engineering
3. Compare with other classifiers
4. Use in ensemble methods
5. Deploy with probability calibration