# Logistic Regression Primer

This is a practical primer for logistic regression implementation

**Sections:**
- Section 1: Core Implementation

## 1. Core Implementation

### 1.1 Setup & Data Loading

In [4]:
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load a binary classification dataset
data = load_breast_cancer()
X, y = data.data, data.target

print("Dataset info:")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Target classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y)}")
print(f"Class balance: {np.bincount(y)[1]/len(y):.1%} malignant")

Dataset info:
Features: 30
Samples: 569
Target classes: ['malignant' 'benign']
Class distribution: [212 357]
Class balance: 62.7% malignant


### 1.2 Train the Model

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# stratify ensures that the training and testing sets have the same proportion of classes as the original dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

print("Data split:")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

model = LogisticRegression(random_state=42, max_iter=10000)
model.fit(X_train, y_train)

print("\nModel trained successfully!")
print(f"Model converged in {model.n_iter_[0]} iterations")

Data split:
Training samples: 455
Test samples: 114

Model trained successfully!
Model converged in 2227 iterations


### 1.3 Make Predictions

In [13]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)

print("Prediction types:")
print(f"Class predictions shape: {y_pred.shape}")
print(f"Probability predictions shape {y_prob.shape}")

# Show the first 10 predictions
print("\nFirst 10 predictions")
print(f"Actual | Predicted | Prob(Benign) | Prob(Malignant)")
print("-" * 50)
for i in range(10):
    actual = "Benign" if y_test[i] == 1 else "Malignant"
    predicted = "Benign" if y_pred[i] == 1 else "Malignant"
    prob_benign = y_prob[i][1]
    prob_malignant = y_prob[i][0]
    print(f"{actual:8} | {predicted:9} | {prob_benign:11.3f} | {prob_malignant:13.3f}")

Prediction types:
Class predictions shape: (114,)
Probability predictions shape (114, 2)

First 10 predictions
Actual | Predicted | Prob(Benign) | Prob(Malignant)
--------------------------------------------------
Malignant | Malignant |       0.000 |         1.000
Benign   | Benign    |       1.000 |         0.000
Malignant | Malignant |       0.050 |         0.950
Benign   | Benign    |       0.604 |         0.396
Malignant | Malignant |       0.000 |         1.000
Benign   | Benign    |       0.983 |         0.017
Benign   | Benign    |       1.000 |         0.000
Malignant | Malignant |       0.000 |         1.000
Malignant | Malignant |       0.000 |         1.000
Malignant | Malignant |       0.000 |         1.000


### 1.4 Evaluate Results

In [17]:
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

accuracy = accuracy_score(y_test, y_pred)
print("Model Performance:")
print(f"Accuracy: {accuracy:.3f} ({accuracy*100:.1f})")

print("\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

feature_names = data.feature_names
coefficients = model.coef_[0]

feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
})
feature_importance['abs_coef'] = abs(feature_importance['coefficient'])
top_features = feature_importance.nlargest(5, 'abs_coef')

print("\nTop 5 Most Important Features:")
for _, row in top_features.iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"{row['feature']}: {row['coefficient']:.3f} {direction} malignant probability")

Model Performance:
Accuracy: 0.965 (96.5)

Detailed Report:
              precision    recall  f1-score   support

   malignant       0.97      0.93      0.95        42
      benign       0.96      0.99      0.97        72

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114


Top 5 Most Important Features:
worst concavity: -1.316 decreases malignant probability
texture error: 1.092 increases malignant probability
mean radius: 0.806 increases malignant probability
worst symmetry: -0.782 decreases malignant probability
worst compactness: -0.754 decreases malignant probability


## 2: Data Preprocessing & Feature Scaling

### 2.1 Why Scaling Matters - Demonstrating the Problem

In [24]:
from sklearn.datasets import load_wine
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore') # To hide convergence warning

data = load_wine()
X, y = data.data, data.target

y_binary = (y==0).astype(int)

feature_scales = pd.DataFrame({
    'feature': data.feature_names,
    'min': X.min(axis=0),
    'max': X.max(axis=0),
    'range': X.max(axis=0) - X.min(axis=0)
}).round(2)

print("Feature Scale Analysis:")
print("Features with HUGE scale differences:")
print(feature_scales.nlargest(5, 'range')[['feature', 'min', 'max', 'range']])
print("\nFeatures with small scales:")
print(feature_scales.nsmallest(5, 'range')[['feature', 'min', 'max', 'range']])

X_train, X_test, y_train, y_test = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

Feature Scale Analysis:
Features with HUGE scale differences:
              feature     min     max    range
12            proline  278.00  1680.0  1402.00
4           magnesium   70.00   162.0    92.00
3   alcalinity_of_ash   10.60    30.0    19.40
9     color_intensity    1.28    13.0    11.72
1          malic_acid    0.74     5.8     5.06

Features with small scales:
                         feature   min   max  range
7           nonflavanoid_phenols  0.13  0.66   0.53
10                           hue  0.48  1.71   1.23
2                            ash  1.36  3.23   1.87
11  od280/od315_of_diluted_wines  1.27  4.00   2.73
5                  total_phenols  0.98  3.88   2.90


### 2.2 Training Without Scaling - See the Problem

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

print("Training WITHOUT scaling:")
model_unscaled = LogisticRegression(random_state=42, max_iter=100)
model_unscaled.fit(X_train, y_train)

# Checking for convergence
print(f"Converged: {model_unscaled.n_iter_[0] < 100}")
print(f"Iterations used: {model_unscaled.n_iter_[0]}/100")

# Checking performance
y_pred_unscaled = model_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy without scaling: {accuracy_unscaled:.3f}")

# Checking the coefficients
coef_unscaled = model_unscaled.coef_[0]
print(f"\nCoefficient range: {coef_unscaled.min():.6f} to {coef_unscaled.max():.6f}")
print("Problem: Coefficients vary wildly due to scale differences!")

Training WITHOUT scaling:
Converged: False
Iterations used: 100/100
Accuracy without scaling: 1.000

Coefficient range: -0.570768 to 1.222616
Problem: Coefficients vary wildly due to scale differences!


### 2.3 Training With StandardScaler - The Solution

In [41]:
from sklearn.preprocessing import StandardScaler

print("Training WITH StandardScaler:")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Transform test data with same scaler and avoiding data leakage

print("After scaling - feature statistics:")
print(f"Mean: {X_train_scaled.mean(axis=0)[:3]}")  # Should be ~0
print(f"Std:  {X_train_scaled.std(axis=0)[:3]}")   # Should be ~1

# Train model on scaled data
model_scaled = LogisticRegression(random_state=42, max_iter=100)  # Same low max_iter
model_scaled.fit(X_train_scaled, y_train)

print(f"\nConverged: {model_scaled.n_iter_[0] < 100}")
print(f"Iterations used: {model_scaled.n_iter_[0]}/100")

# Performance
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled:.3f}")

# Coefficients are now comparable
coef_scaled = model_scaled.coef_[0]
print(f"\nCoefficient range: {coef_scaled.min():.3f} to {coef_scaled.max():.3f}")
print("Much better! Coefficients are now comparable in magnitude")

Training WITH StandardScaler:
After scaling - feature statistics:
Mean: [4.35957999e-15 1.10475009e-15 2.02576610e-15]
Std:  [1. 1. 1.]

Converged: True
Iterations used: 12/100
Accuracy with scaling: 0.972

Coefficient range: -1.096 to 1.784
Much better! Coefficients are now comparable in magnitude


### 2.4 Impact Comparison

In [49]:
# Direct comparison
print("=== SCALING IMPACT COMPARISON ===")
print(f"{'Metric':<20} {'Without Scaling':<15} {'With Scaling':<15} {'Improvement'}")
print("-" * 65)

metrics = {
    'Convergence': [model_unscaled.n_iter_[0] < 100, model_scaled.n_iter_[0] < 100],
    'Iterations': [model_unscaled.n_iter_[0], model_scaled.n_iter_[0]],
    'Accuracy': [accuracy_unscaled, accuracy_scaled],
    'Coef Range': [coef_unscaled.max() - coef_unscaled.min(), 
                   coef_scaled.max() - coef_scaled.min()]
}

for metric, values in metrics.items():
    if metric == 'Convergence':
        print(f"{metric:<20} {str(values[0]):<15} {str(values[1]):<15} {'✓' if values[1] else '✗'}")
    elif metric == 'Iterations':
        improvement = f"{values[0] - values[1]:+d}"
        print(f"{metric:<20} {values[0]:<15} {values[1]:<15} {improvement}")
    else:
        improvement = f"{values[1] - values[0]:+.3f}"
        print(f"{metric:<20} {values[0]:<15.3f} {values[1]:<15.3f} {improvement}")

=== SCALING IMPACT COMPARISON ===
Metric               Without Scaling With Scaling    Improvement
-----------------------------------------------------------------
Convergence          False           True            ✓
Iterations           100             12              +88
Accuracy             1.000           0.972           -0.028
Coef Range           1.793           2.880           +1.087


## 3. Handling Class Imbalance

### 3.1 Detecting Imbalance - The Problem

In [53]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate imbalanced data: 5% fraud, 95% normal transactions
X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=5,
    n_redundant=3,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    flip_y=0.01,           # Add some noise
    random_state=42
)

print("Class Imbalance Analysis:")
unique, counts = np.unique(y, return_counts=True)
for class_label, count in zip(unique, counts):
    percentage = count / len(y) * 100
    print(f"Class {class_label}: {count:,} samples ({percentage:.1f}%)")

class_names = ['Normal', 'Fraud']
print(f"\nDataset: {X.shape[0]:,} transactions")
print(f"Fraud rate: {np.mean(y):.1%} (highly imbalanced!)")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Class Imbalance Analysis:
Class 0: 9,456 samples (94.6%)
Class 1: 544 samples (5.4%)

Dataset: 10,000 transactions
Fraud rate: 5.4% (highly imbalanced!)


### 3.2 Why Accuracy Fails - Naive Model

In [56]:
from sklearn.preprocessing import StandardScaler

# Scale features (always needed for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train basic model (no class balancing)
model_basic = LogisticRegression(random_state=42)
model_basic.fit(X_train_scaled, y_train)

# Evaluate with accuracy only
y_pred_basic = model_basic.predict(X_test_scaled)
accuracy_basic = accuracy_score(y_test, y_pred_basic)

print("=== BASIC MODEL (No Class Balancing) ===")
print(f"Accuracy: {accuracy_basic:.3f} ({accuracy_basic*100:.1f}%)")
print("\nDetailed Results:")
print(classification_report(y_test, y_pred_basic, target_names=class_names))

# Show the problem with accuracy
fraud_detected = np.sum((y_test == 1) & (y_pred_basic == 1))
total_fraud = np.sum(y_test == 1)
print(f"\nTHE PROBLEM:")
print(f"Fraud cases detected: {fraud_detected}/{total_fraud}")
print(f"Fraud detection rate: {fraud_detected/total_fraud:.1%}")
print(f"Model might just predict 'Normal' for everything and still get 95% accuracy!")

=== BASIC MODEL (No Class Balancing) ===
Accuracy: 0.949 (94.9%)

Detailed Results:
              precision    recall  f1-score   support

      Normal       0.95      1.00      0.97      1891
       Fraud       0.73      0.10      0.18       109

    accuracy                           0.95      2000
   macro avg       0.84      0.55      0.58      2000
weighted avg       0.94      0.95      0.93      2000


THE PROBLEM:
Fraud cases detected: 11/109
Fraud detection rate: 10.1%
Model might just predict 'Normal' for everything and still get 95% accuracy!


### 3.3 Better Metrics - Understanding What Matters

In [59]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Calculate better metrics for imbalanced data
y_prob_basic = model_basic.predict_proba(X_test_scaled)[:, 1]

metrics_basic = {
    'Accuracy': accuracy_score(y_test, y_pred_basic),
    'Precision': precision_score(y_test, y_pred_basic),
    'Recall': recall_score(y_test, y_pred_basic),
    'F1-Score': f1_score(y_test, y_pred_basic),
    'ROC-AUC': roc_auc_score(y_test, y_prob_basic)
}

print("=== UNDERSTANDING METRICS FOR IMBALANCED DATA ===")
for metric, value in metrics_basic.items():
    print(f"{metric}: {value:.3f}")

print(f"\n📚 METRIC EXPLANATIONS:")
print(f"• Accuracy: Overall correct predictions (misleading with imbalance)")
print(f"• Precision: Of predicted fraud, how many were actually fraud?")
print(f"• Recall: Of actual fraud, how many did we catch?")
print(f"• F1-Score: Harmonic mean of precision and recall")
print(f"• ROC-AUC: How well can model distinguish between classes?")

# Show confusion matrix breakdown
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_basic)
tn, fp, fn, tp = cm.ravel()

print(f"\n📊 CONFUSION MATRIX BREAKDOWN:")
print(f"True Negatives (Normal correctly): {tn}")
print(f"False Positives (Normal as Fraud): {fp}")  
print(f"False Negatives (Fraud as Normal): {fn} ⚠️ BAD!")
print(f"True Positives (Fraud correctly): {tp}")

=== UNDERSTANDING METRICS FOR IMBALANCED DATA ===
Accuracy: 0.949
Precision: 0.733
Recall: 0.101
F1-Score: 0.177
ROC-AUC: 0.810

📚 METRIC EXPLANATIONS:
• Accuracy: Overall correct predictions (misleading with imbalance)
• Precision: Of predicted fraud, how many were actually fraud?
• Recall: Of actual fraud, how many did we catch?
• F1-Score: Harmonic mean of precision and recall
• ROC-AUC: How well can model distinguish between classes?

📊 CONFUSION MATRIX BREAKDOWN:
True Negatives (Normal correctly): 1887
False Positives (Normal as Fraud): 4
False Negatives (Fraud as Normal): 98 ⚠️ BAD!
True Positives (Fraud correctly): 11


### 3.4 Class Weights - The Solution

In [60]:
# Train model with balanced class weights
model_balanced = LogisticRegression(
    class_weight='balanced',  # Automatically balances classes
    random_state=42
)
model_balanced.fit(X_train_scaled, y_train)

# Compare results
y_pred_balanced = model_balanced.predict(X_test_scaled)
y_prob_balanced = model_balanced.predict_proba(X_test_scaled)[:, 1]

metrics_balanced = {
    'Accuracy': accuracy_score(y_test, y_pred_balanced),
    'Precision': precision_score(y_test, y_pred_balanced),
    'Recall': recall_score(y_test, y_pred_balanced),
    'F1-Score': f1_score(y_test, y_pred_balanced),
    'ROC-AUC': roc_auc_score(y_test, y_prob_balanced)
}

print("=== COMPARISON: BASIC vs BALANCED ===")
print(f"{'Metric':<12} {'Basic':<8} {'Balanced':<10} {'Change'}")
print("-" * 40)

for metric in metrics_basic.keys():
    basic_val = metrics_basic[metric]
    balanced_val = metrics_balanced[metric]
    change = balanced_val - basic_val
    change_str = f"{change:+.3f}"
    print(f"{metric:<12} {basic_val:<8.3f} {balanced_val:<10.3f} {change_str}")

# Show what class_weight='balanced' does
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
print(f"\n⚖️ CLASS WEIGHTS EXPLANATION:")
print(f"Normal class weight: {class_weights[0]:.3f}")
print(f"Fraud class weight: {class_weights[1]:.3f}")
print(f"Fraud gets {class_weights[1]/class_weights[0]:.1f}x more weight in training")

=== COMPARISON: BASIC vs BALANCED ===
Metric       Basic    Balanced   Change
----------------------------------------
Accuracy     0.949    0.761      -0.188
Precision    0.733    0.157      -0.577
Recall       0.101    0.771      +0.670
F1-Score     0.177    0.260      +0.083
ROC-AUC      0.810    0.830      +0.020

⚖️ CLASS WEIGHTS EXPLANATION:
Normal class weight: 0.529
Fraud class weight: 9.195
Fraud gets 17.4x more weight in training


### 3.5 Custom Threshold Tuning - Business Optimization

In [61]:
# Sometimes you need custom thresholds for business needs
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Find optimal threshold for different business objectives
precision, recall, thresholds = precision_recall_curve(y_test, y_prob_balanced)

# Calculate F1 scores for all thresholds
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
optimal_threshold_f1 = thresholds[np.argmax(f1_scores)]

print("=== THRESHOLD OPTIMIZATION ===")
print(f"Default threshold: 0.5")
print(f"Optimal F1 threshold: {optimal_threshold_f1:.3f}")

# Test different business scenarios
scenarios = {
    'Conservative (High Precision)': 0.8,  # Only flag if very confident
    'Aggressive (High Recall)': 0.3,      # Flag more cases to catch fraud
    'Balanced F1': optimal_threshold_f1    # Optimal F1 score
}

print(f"\n🎯 BUSINESS SCENARIO COMPARISON:")
print(f"{'Scenario':<25} {'Threshold':<11} {'Precision':<10} {'Recall':<8} {'F1'}")
print("-" * 65)

for scenario, threshold in scenarios.items():
    y_pred_custom = (y_prob_balanced >= threshold).astype(int)
    prec = precision_score(y_test, y_pred_custom)
    rec = recall_score(y_test, y_pred_custom)
    f1 = f1_score(y_test, y_pred_custom)
    
    print(f"{scenario:<25} {threshold:<11.3f} {prec:<10.3f} {rec:<8.3f} {f1:.3f}")

print(f"\n💡 BUSINESS INTERPRETATION:")
print(f"• Conservative: Minimize false alarms, might miss some fraud")
print(f"• Aggressive: Catch most fraud, but more false alarms")
print(f"• Balanced: Best overall trade-off between precision and recall")

=== THRESHOLD OPTIMIZATION ===
Default threshold: 0.5
Optimal F1 threshold: 0.826

🎯 BUSINESS SCENARIO COMPARISON:
Scenario                  Threshold   Precision  Recall   F1
-----------------------------------------------------------------
Conservative (High Precision) 0.800       0.327      0.339    0.333
Aggressive (High Recall)  0.300       0.095      0.890    0.172
Balanced F1               0.826       0.391      0.312    0.347

💡 BUSINESS INTERPRETATION:
• Conservative: Minimize false alarms, might miss some fraud
• Aggressive: Catch most fraud, but more false alarms
• Balanced: Best overall trade-off between precision and recall


## 4. Regularization & Hyperparameter Tuning

### 4.1 L1 vs L2 Regularization - Understanding the Difference

In [5]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Dataset with many features (some irrelevant)
X, y = make_classification(
    n_samples=1000,
    n_features=50,        # 50 features total
    n_informative=10,     # Only 10 are actually useful
    n_redundant=5,        # 5 are redundant
    n_clusters_per_class=1,
    random_state=42
)

print("High-Dimensional Dataset:")
print(f"Total features: {X.shape[1]}")
print(f"Informative features: 10")
print(f"Irrelevant features: {50 - 10}")

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

High-Dimensional Dataset:
Total features: 50
Informative features: 10
Irrelevant features: 40


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Compare different regularization approaches
models = {
    'No Regularization': LogisticRegression(C=1e6, random_state=42, max_iter=2000),  # Very high C = almost no regularization
    'L1 (Lasso)': LogisticRegression(penalty='l1', C=0.1, solver='liblinear', random_state=42),
    'L2 (Ridge)': LogisticRegression(penalty='l2', C=0.1, random_state=42, max_iter=2000)
}

results = {}
print("=== REGULARIZATION COMPARISON ===")
print(f"{'Model':<18} {'Train Acc':<10} {'Test Acc':<9} {'ROC-AUC':<8} {'Non-Zero Coef'}")
print("-" * 60)

for name, model in models.items():
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Evaluate
    train_acc = accuracy_score(y_train, model.predict(X_train_scaled))
    test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
    test_auc = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1])
    
    # Count non-zero coefficients (feature selection)
    non_zero_coef = np.sum(np.abs(model.coef_[0]) > 0.001)
    
    results[name] = {
        'train_acc': train_acc,
        'test_acc': test_acc,
        'auc': test_auc,
        'non_zero_coef': non_zero_coef,
        'coefficients': model.coef_[0]
    }
    
    print(f"{name:<18} {train_acc:<10.3f} {test_acc:<9.3f} {test_auc:<8.3f} {non_zero_coef}")

# Show overfitting
print(f"\n🎯 OVERFITTING ANALYSIS:")
for name, result in results.items():
    overfitting = result['train_acc'] - result['test_acc']
    print(f"{name}: {overfitting:.3f} (lower is better)")

=== REGULARIZATION COMPARISON ===
Model              Train Acc  Test Acc  ROC-AUC  Non-Zero Coef
------------------------------------------------------------
No Regularization  0.966      0.950     0.995    50
L1 (Lasso)         0.963      0.965     0.997    9
L2 (Ridge)         0.965      0.965     0.997    50

🎯 OVERFITTING ANALYSIS:
No Regularization: 0.016 (lower is better)
L1 (Lasso): -0.002 (lower is better)
L2 (Ridge): 0.000 (lower is better)


### 4.3 Understanding the C Parameter

In [10]:
# Test different C values to understand regularization strength
C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

print("=== C PARAMETER IMPACT (L2 Regularization) ===")
print(f"{'C Value':<8} {'Train Acc':<10} {'Test Acc':<9} {'Overfitting':<11} {'Non-Zero Coef'}")
print("-" * 55)

c_results = {}
for C in C_values:
    model = LogisticRegression(penalty='l2', C=C, random_state=42, max_iter=2000)
    model.fit(X_train_scaled, y_train)
    
    train_acc = accuracy_score(y_train, model.predict(X_train_scaled))
    test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
    overfitting = train_acc - test_acc
    non_zero_coef = np.sum(np.abs(model.coef_[0]) > 0.001)
    
    c_results[C] = {'train': train_acc, 'test': test_acc, 'overfitting': overfitting}
    
    print(f"{C:<8} {train_acc:<10.3f} {test_acc:<9.3f} {overfitting:<11.3f} {non_zero_coef}")

print(f"\n💡 C PARAMETER INTERPRETATION:")
print(f"• High C (100): Less regularization → More complex model → Potential overfitting")
print(f"• Low C (0.001): More regularization → Simpler model → Potential underfitting")
print(f"• Sweet spot: Usually between 0.1 and 10.0")

# Find best C from our test
best_c = min(c_results.keys(), key=lambda x: c_results[x]['overfitting'])
print(f"• Best C from our test: {best_c} (lowest overfitting)")

=== C PARAMETER IMPACT (L2 Regularization) ===
C Value  Train Acc  Test Acc  Overfitting Non-Zero Coef
-------------------------------------------------------
0.001    0.949      0.950     -0.001      47
0.01     0.960      0.960     0.000       50
0.1      0.965      0.965     0.000       50
1.0      0.963      0.955     0.008       50
10.0     0.964      0.950     0.014       50
100.0    0.966      0.950     0.016       50

💡 C PARAMETER INTERPRETATION:
• High C (100): Less regularization → More complex model → Potential overfitting
• Low C (0.001): More regularization → Simpler model → Potential underfitting
• Sweet spot: Usually between 0.1 and 10.0
• Best C from our test: 0.001 (lowest overfitting)


### 4.4 GridSearchCV - Systematic Hyperparameter Tuning

In [11]:
from sklearn.model_selection import GridSearchCV, cross_val_score

# Define parameter grid to search
param_grid = {
    'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # Works with both L1 and L2
}

print("=== GRIDSEARCHCV FOR OPTIMAL HYPERPARAMETERS ===")
print("Searching through parameter combinations...")

# Create GridSearchCV
grid_search = GridSearchCV(
    LogisticRegression(random_state=42, max_iter=2000),
    param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='roc_auc',       # Use AUC for imbalanced data
    n_jobs=-1,               # Use all CPU cores
    verbose=0
)

# Fit grid search
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation AUC: {grid_search.best_score_:.3f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test_scaled)[:, 1])
test_acc = accuracy_score(y_test, best_model.predict(X_test_scaled))

print(f"Test set performance:")
print(f"• Test AUC: {test_auc:.3f}")
print(f"• Test Accuracy: {test_acc:.3f}")

# Show feature selection with best model
if grid_search.best_params_['penalty'] == 'l1':
    selected_features = np.sum(np.abs(best_model.coef_[0]) > 0.001)
    print(f"• Features selected by L1: {selected_features}/{X.shape[1]}")

=== GRIDSEARCHCV FOR OPTIMAL HYPERPARAMETERS ===
Searching through parameter combinations...
Best parameters: {'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}
Best cross-validation AUC: 0.990
Test set performance:
• Test AUC: 0.997
• Test Accuracy: 0.965
• Features selected by L1: 9/50


### 4.5 Cross-Validation - Reliable Model Evaluation

In [12]:
from sklearn.model_selection import cross_validate

# Demonstrate proper cross-validation
print("=== CROSS-VALIDATION FOR RELIABLE EVALUATION ===")

# Compare our best model with baseline
models_to_compare = {
    'Baseline (no reg)': LogisticRegression(C=1e6, random_state=42, max_iter=2000),
    'Best from Grid': grid_search.best_estimator_
}

cv_scores = {}
for name, model in models_to_compare.items():
    # Multiple metrics with cross-validation
    scores = cross_validate(
        model, X_train_scaled, y_train,
        cv=5,
        scoring=['accuracy', 'roc_auc', 'precision', 'recall'],
        return_train_score=True
    )
    
    cv_scores[name] = scores
    
    print(f"\n{name}:")
    print(f"• CV Accuracy: {scores['test_accuracy'].mean():.3f} ± {scores['test_accuracy'].std():.3f}")
    print(f"• CV ROC-AUC: {scores['test_roc_auc'].mean():.3f} ± {scores['test_roc_auc'].std():.3f}")
    print(f"• CV Precision: {scores['test_precision'].mean():.3f} ± {scores['test_precision'].std():.3f}")
    print(f"• CV Recall: {scores['test_recall'].mean():.3f} ± {scores['test_recall'].std():.3f}")
    
    # Check for overfitting
    train_auc = scores['train_roc_auc'].mean()
    test_auc = scores['test_roc_auc'].mean()
    overfitting = train_auc - test_auc
    print(f"• Overfitting (Train-CV): {overfitting:.3f}")

print(f"\n📊 WHY CROSS-VALIDATION MATTERS:")
print(f"• Single train/test split can be lucky or unlucky")
print(f"• CV uses multiple splits for reliable estimates")
print(f"• Standard deviation shows consistency across folds")
print(f"• Compare train vs CV scores to detect overfitting")

# Final model selection
final_model = grid_search.best_estimator_
print(f"\n🏆 FINAL MODEL SELECTED:")
print(f"• Algorithm: LogisticRegression")
print(f"• Penalty: {grid_search.best_params_['penalty']}")
print(f"• C: {grid_search.best_params_['C']}")
print(f"• Expected AUC: {grid_search.best_score_:.3f}")

=== CROSS-VALIDATION FOR RELIABLE EVALUATION ===

Baseline (no reg):
• CV Accuracy: 0.941 ± 0.017
• CV ROC-AUC: 0.986 ± 0.008
• CV Precision: 0.939 ± 0.028
• CV Recall: 0.945 ± 0.010
• Overfitting (Train-CV): 0.010

Best from Grid:
• CV Accuracy: 0.963 ± 0.017
• CV ROC-AUC: 0.990 ± 0.007
• CV Precision: 0.956 ± 0.026
• CV Recall: 0.970 ± 0.013
• Overfitting (Train-CV): 0.001

📊 WHY CROSS-VALIDATION MATTERS:
• Single train/test split can be lucky or unlucky
• CV uses multiple splits for reliable estimates
• Standard deviation shows consistency across folds
• Compare train vs CV scores to detect overfitting

🏆 FINAL MODEL SELECTED:
• Algorithm: LogisticRegression
• Penalty: l1
• C: 0.1
• Expected AUC: 0.990


## Section 5 - Feature Engineering

### 5.1 Categorical Encoding - Handling Non-Numeric Data

In [14]:
import numpy as np
import pandas as pd

# Customer churn dataset with mixed feature types
np.random.seed(42)
n_customers = 2000

# Create realistic customer data
data = {
    'age': np.random.normal(40, 15, n_customers).astype(int),
    'income': np.random.normal(50000, 20000, n_customers),
    'months_tenure': np.random.uniform(1, 60, n_customers),
    'contract_type': np.random.choice(['monthly', 'annual', 'two_year'], n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['credit_card', 'bank_transfer', 'electronic_check'], n_customers, p=[0.4, 0.3, 0.3]),
    'internet_service': np.random.choice(['dsl', 'fiber', 'no'], n_customers, p=[0.4, 0.4, 0.2])
}

# Create logical churn patterns
churn_probability = (
    -0.02 * data['age'] +                    # Younger customers churn more
    -0.00001 * data['income'] +              # Higher income customers churn less
    -0.05 * data['months_tenure'] +          # Longer tenure = less churn
    (np.array(data['contract_type']) == 'monthly') * 1.5 +  # Monthly contracts churn more
    (np.array(data['payment_method']) == 'electronic_check') * 0.8 +  # Electronic check risky
    2.0  # Base churn rate
)

# Convert to probability and binary outcome
churn_prob = 1 / (1 + np.exp(-churn_probability))
data['churned'] = np.random.binomial(1, churn_prob)

# Create DataFrame
df = pd.DataFrame(data)

print("=== MIXED DATA TYPES DATASET ===")
print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"\nData types:")
print(df.dtypes)
print(f"\nCategorical variables:")
categorical_cols = ['contract_type', 'payment_method', 'internet_service']
for col in categorical_cols:
    print(f"• {col}: {df[col].nunique()} categories {list(df[col].unique())}")

=== MIXED DATA TYPES DATASET ===
Dataset shape: (2000, 7)
Churn rate: 53.5%

Data types:
age                   int32
income              float64
months_tenure       float64
contract_type        object
payment_method       object
internet_service     object
churned               int32
dtype: object

Categorical variables:
• contract_type: 3 categories ['annual', 'monthly', 'two_year']
• payment_method: 3 categories ['bank_transfer', 'credit_card', 'electronic_check']
• internet_service: 3 categories ['dsl', 'fiber', 'no']


### 5.2 One-Hot Encoding with Pandas

In [15]:
# Before encoding - show the problem
print("=== BEFORE ONE-HOT ENCODING ===")
numeric_cols = ['age', 'income', 'months_tenure']
X_numeric_only = df[numeric_cols]
y = df['churned']

# Try training with numeric features only
X_train_num, X_test_num, y_train, y_test = train_test_split(
    X_numeric_only, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

model_numeric = LogisticRegression(random_state=42)
model_numeric.fit(X_train_num_scaled, y_train)

auc_numeric = roc_auc_score(y_test, model_numeric.predict_proba(X_test_num_scaled)[:, 1])
print(f"AUC with numeric features only: {auc_numeric:.3f}")

# One-hot encoding
print(f"\n=== AFTER ONE-HOT ENCODING ===")
df_encoded = pd.get_dummies(df, columns=categorical_cols, prefix=categorical_cols)

print(f"Original features: {len(df.columns)}")
print(f"After encoding: {len(df_encoded.columns)}")
print(f"New categorical features created:")

for col in categorical_cols:
    new_cols = [c for c in df_encoded.columns if c.startswith(col)]
    print(f"• {col} → {new_cols}")

# Train with all features
X_encoded = df_encoded.drop('churned', axis=1)
X_train_enc, X_test_enc, _, _ = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)

scaler_enc = StandardScaler()
X_train_enc_scaled = scaler_enc.fit_transform(X_train_enc)
X_test_enc_scaled = scaler_enc.transform(X_test_enc)

model_encoded = LogisticRegression(random_state=42)
model_encoded.fit(X_train_enc_scaled, y_train)

auc_encoded = roc_auc_score(y_test, model_encoded.predict_proba(X_test_enc_scaled)[:, 1])
print(f"\nAUC with categorical features: {auc_encoded:.3f}")
print(f"Improvement: {auc_encoded - auc_numeric:+.3f}")

=== BEFORE ONE-HOT ENCODING ===
AUC with numeric features only: 0.669

=== AFTER ONE-HOT ENCODING ===
Original features: 7
After encoding: 13
New categorical features created:
• contract_type → ['contract_type_annual', 'contract_type_monthly', 'contract_type_two_year']
• payment_method → ['payment_method_bank_transfer', 'payment_method_credit_card', 'payment_method_electronic_check']
• internet_service → ['internet_service_dsl', 'internet_service_fiber', 'internet_service_no']

AUC with categorical features: 0.770
Improvement: +0.101


### 5.3 Polynomial Features - Creating Interactions

In [16]:
from sklearn.preprocessing import PolynomialFeatures

# Start with simple numeric features for polynomial demo
simple_features = df[['age', 'income', 'months_tenure']]

print("=== POLYNOMIAL FEATURES FOR INTERACTIONS ===")
print("Original features:")
print(simple_features.columns.tolist())

# Create polynomial features (degree 2 includes interactions)
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(simple_features)

# Get feature names
feature_names = poly.get_feature_names_out(simple_features.columns)
print(f"\nAfter polynomial features (degree=2):")
print(f"Feature count: {simple_features.shape[1]} → {X_poly.shape[1]}")
print("New features created:")
for i, name in enumerate(feature_names):
    if i >= 3:  # Skip original features
        print(f"• {name}")

# Train model with polynomial features
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y, test_size=0.2, random_state=42, stratify=y
)

scaler_poly = StandardScaler()
X_train_poly_scaled = scaler_poly.fit_transform(X_train_poly)
X_test_poly_scaled = scaler_poly.transform(X_test_poly)

# Use regularization since we have more features now
model_poly = LogisticRegression(C=0.1, random_state=42, max_iter=2000)
model_poly.fit(X_train_poly_scaled, y_train_poly)

auc_poly = roc_auc_score(y_test_poly, model_poly.predict_proba(X_test_poly_scaled)[:, 1])

print(f"\nPerformance comparison:")
print(f"• Numeric only: {auc_numeric:.3f}")
print(f"• With categoricals: {auc_encoded:.3f}")
print(f"• With polynomials: {auc_poly:.3f}")

# Show most important polynomial features
coef_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': model_poly.coef_[0],
    'abs_coefficient': np.abs(model_poly.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print(f"\nTop 5 most important polynomial features:")
print(coef_importance.head()[['feature', 'coefficient']].round(3))

=== POLYNOMIAL FEATURES FOR INTERACTIONS ===
Original features:
['age', 'income', 'months_tenure']

After polynomial features (degree=2):
Feature count: 3 → 9
New features created:
• age^2
• age income
• age months_tenure
• income^2
• income months_tenure
• months_tenure^2

Performance comparison:
• Numeric only: 0.669
• With categoricals: 0.770
• With polynomials: 0.669

Top 5 most important polynomial features:
                feature  coefficient
2         months_tenure       -0.464
5     age months_tenure       -0.204
7  income months_tenure       -0.184
6              income^2       -0.144
0                   age       -0.141


### 5.4 Feature Selection - Finding What Matters

In [17]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE

print("=== FEATURE SELECTION METHODS ===")

# Method 1: Statistical selection (SelectKBest)
# Select top 10 features based on ANOVA F-test
selector_stats = SelectKBest(score_func=f_classif, k=10)
X_selected_stats = selector_stats.fit_transform(X_train_enc_scaled, y_train)

# Get selected feature names
selected_features_stats = X_encoded.columns[selector_stats.get_support()]
print("Statistical selection (top 10 features):")
for feature in selected_features_stats:
    print(f"• {feature}")

# Method 2: Recursive Feature Elimination (RFE)
# Use logistic regression to rank features
estimator = LogisticRegression(C=1.0, random_state=42, max_iter=2000)
selector_rfe = RFE(estimator, n_features_to_select=10, step=1)
X_selected_rfe = selector_rfe.fit_transform(X_train_enc_scaled, y_train)

selected_features_rfe = X_encoded.columns[selector_rfe.get_support()]
print(f"\nRFE selection (top 10 features):")
for feature in selected_features_rfe:
    print(f"• {feature}")

# Method 3: L1 regularization (automatic feature selection)
model_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', random_state=42)
model_l1.fit(X_train_enc_scaled, y_train)

# Features with non-zero coefficients
l1_selected = X_encoded.columns[np.abs(model_l1.coef_[0]) > 0.001]
print(f"\nL1 regularization selection ({len(l1_selected)} features):")
for feature in l1_selected:
    print(f"• {feature}")

# Compare performance of different selection methods
selection_results = {}

# Original (all features)
auc_all = roc_auc_score(y_test, model_encoded.predict_proba(X_test_enc_scaled)[:, 1])
selection_results['All features'] = auc_all

# Statistical selection
model_stats = LogisticRegression(random_state=42)
model_stats.fit(X_selected_stats, y_train)
X_test_stats = selector_stats.transform(X_test_enc_scaled)
auc_stats = roc_auc_score(y_test, model_stats.predict_proba(X_test_stats)[:, 1])
selection_results['Statistical (10)'] = auc_stats

# RFE selection
model_rfe = LogisticRegression(random_state=42)
model_rfe.fit(X_selected_rfe, y_train)
X_test_rfe = selector_rfe.transform(X_test_enc_scaled)
auc_rfe = roc_auc_score(y_test, model_rfe.predict_proba(X_test_rfe)[:, 1])
selection_results['RFE (10)'] = auc_rfe

# L1 selection
auc_l1 = roc_auc_score(y_test, model_l1.predict_proba(X_test_enc_scaled)[:, 1])
selection_results['L1 regularization'] = auc_l1

print(f"\n=== FEATURE SELECTION PERFORMANCE ===")
print(f"{'Method':<20} {'Features':<10} {'AUC':<8}")
print("-" * 40)
print(f"{'All features':<20} {X_encoded.shape[1]:<10} {auc_all:.3f}")
print(f"{'Statistical':<20} {10:<10} {auc_stats:.3f}")
print(f"{'RFE':<20} {10:<10} {auc_rfe:.3f}")
print(f"{'L1 regularization':<20} {len(l1_selected):<10} {auc_l1:.3f}")

=== FEATURE SELECTION METHODS ===
Statistical selection (top 10 features):
• age
• income
• months_tenure
• contract_type_annual
• contract_type_monthly
• contract_type_two_year
• payment_method_bank_transfer
• payment_method_credit_card
• payment_method_electronic_check
• internet_service_no

RFE selection (top 10 features):
• age
• income
• months_tenure
• contract_type_annual
• contract_type_monthly
• contract_type_two_year
• payment_method_bank_transfer
• payment_method_credit_card
• payment_method_electronic_check
• internet_service_dsl

L1 regularization selection (6 features):
• age
• income
• months_tenure
• contract_type_monthly
• payment_method_electronic_check
• internet_service_dsl

=== FEATURE SELECTION PERFORMANCE ===
Method               Features   AUC     
----------------------------------------
All features         12         0.770
Statistical          10         0.772
RFE                  10         0.770
L1 regularization    6          0.771


### 5.5 Real-World Example - Complete Feature Engineering Pipeline

In [18]:
# Create a complete feature engineering function
def engineer_features(df, target_col):
    """Complete feature engineering pipeline"""
    
    # Separate target
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    
    # Identify feature types
    numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()
    
    print(f"Feature engineering pipeline:")
    print(f"• Numeric features: {len(numeric_features)}")
    print(f"• Categorical features: {len(categorical_features)}")
    
    # 1. One-hot encode categorical features
    if categorical_features:
        X_encoded = pd.get_dummies(X, columns=categorical_features, prefix=categorical_features)
        print(f"• After one-hot encoding: {X_encoded.shape[1]} features")
    else:
        X_encoded = X.copy()
    
    # 2. Create some domain-specific features
    if 'age' in X_encoded.columns and 'income' in X_encoded.columns:
        X_encoded['income_per_age'] = X_encoded['income'] / (X_encoded['age'] + 1)  # Avoid division by zero
        print(f"• Added income_per_age ratio")
    
    if 'months_tenure' in X_encoded.columns:
        X_encoded['is_new_customer'] = (X_encoded['months_tenure'] < 12).astype(int)
        X_encoded['is_loyal_customer'] = (X_encoded['months_tenure'] > 36).astype(int)
        print(f"• Added customer lifecycle features")
    
    print(f"• Final feature count: {X_encoded.shape[1]}")
    
    return X_encoded, y

# Apply complete feature engineering
X_final, y_final = engineer_features(df.copy(), 'churned')

# Train final model
X_train_final, X_test_final, y_train_final, y_test_final = train_test_split(
    X_final, y_final, test_size=0.2, random_state=42, stratify=y_final
)

scaler_final = StandardScaler()
X_train_final_scaled = scaler_final.fit_transform(X_train_final)
X_test_final_scaled = scaler_final.transform(X_test_final)

model_final = LogisticRegression(C=0.1, random_state=42, max_iter=2000)
model_final.fit(X_train_final_scaled, y_train_final)

auc_final = roc_auc_score(y_test_final, model_final.predict_proba(X_test_final_scaled)[:, 1])

print(f"\n=== FEATURE ENGINEERING IMPACT ===")
print(f"• Baseline (numeric only): {auc_numeric:.3f}")
print(f"• + Categorical encoding: {auc_encoded:.3f}")
print(f"• + Complete engineering: {auc_final:.3f}")
print(f"• Total improvement: {auc_final - auc_numeric:+.3f}")

# Show most important engineered features
feature_importance_final = pd.DataFrame({
    'feature': X_final.columns,
    'coefficient': model_final.coef_[0],
    'abs_coefficient': np.abs(model_final.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print(f"\nTop 8 most important features:")
for i, (_, row) in enumerate(feature_importance_final.head(8).iterrows()):
    direction = "↑" if row['coefficient'] > 0 else "↓"
    print(f"{i+1}. {row['feature']}: {row['coefficient']:.3f} {direction}")

Feature engineering pipeline:
• Numeric features: 3
• Categorical features: 3
• After one-hot encoding: 12 features
• Added income_per_age ratio
• Added customer lifecycle features
• Final feature count: 15

=== FEATURE ENGINEERING IMPACT ===
• Baseline (numeric only): 0.669
• + Categorical encoding: 0.770
• + Complete engineering: 0.769
• Total improvement: +0.100

Top 8 most important features:
1. months_tenure: -0.727 ↓
2. contract_type_monthly: 0.399 ↑
3. age: -0.340 ↓
4. income: -0.261 ↓
5. contract_type_annual: -0.247 ↓
6. payment_method_electronic_check: 0.231 ↑
7. contract_type_two_year: -0.224 ↓
8. is_new_customer: 0.170 ↑


## Section 6 - Production Pipeline

### 6.1 sklearn Pipeline - Preprocessing + Model in One Step

In [19]:
# Create production-ready pipeline using previous churn dataset
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Recreate our customer dataset (same as Section 5)
np.random.seed(42)
n_customers = 2000

data = {
    'age': np.random.normal(40, 15, n_customers).astype(int),
    'income': np.random.normal(50000, 20000, n_customers),
    'months_tenure': np.random.uniform(1, 60, n_customers),
    'contract_type': np.random.choice(['monthly', 'annual', 'two_year'], n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['credit_card', 'bank_transfer', 'electronic_check'], n_customers, p=[0.4, 0.3, 0.3]),
    'internet_service': np.random.choice(['dsl', 'fiber', 'no'], n_customers, p=[0.4, 0.4, 0.2])
}

# Create churn target
churn_probability = (
    -0.02 * data['age'] +
    -0.00001 * data['income'] +
    -0.05 * data['months_tenure'] +
    (np.array(data['contract_type']) == 'monthly') * 1.5 +
    (np.array(data['payment_method']) == 'electronic_check') * 0.8 +
    2.0
)
churn_prob = 1 / (1 + np.exp(-churn_probability))
data['churned'] = np.random.binomial(1, churn_prob)

df = pd.DataFrame(data)

print("=== PRODUCTION PIPELINE SETUP ===")
print(f"Dataset shape: {df.shape}")

# Define feature types for pipeline
numeric_features = ['age', 'income', 'months_tenure']
categorical_features = ['contract_type', 'payment_method', 'internet_service']

print(f"Numeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")

# Create preprocessing steps
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing values
    ('scaler', StandardScaler())                     # Scale features
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))  # One-hot encode
])

# Combine preprocessing steps
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Create complete pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(C=0.1, random_state=42, max_iter=2000))
])

print(f"\nPipeline steps:")
for i, (name, step) in enumerate(pipeline.steps):
    print(f"{i+1}. {name}: {type(step).__name__}")

=== PRODUCTION PIPELINE SETUP ===
Dataset shape: (2000, 7)
Numeric features: ['age', 'income', 'months_tenure']
Categorical features: ['contract_type', 'payment_method', 'internet_service']

Pipeline steps:
1. preprocessor: ColumnTransformer
2. classifier: LogisticRegression


### 6.2 Training and Evaluating the Pipeline

In [20]:
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Prepare data
X = df.drop('churned', axis=1)
y = df['churned']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=== TRAINING PRODUCTION PIPELINE ===")

# Train pipeline (handles all preprocessing automatically)
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)

print(f"Pipeline Performance:")
print(f"• Accuracy: {accuracy:.3f}")
print(f"• ROC-AUC: {auc:.3f}")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))

# Show what pipeline learned
feature_names = (numeric_features + 
                list(pipeline.named_steps['preprocessor']
                    .named_transformers_['cat']
                    .named_steps['onehot']
                    .get_feature_names_out(categorical_features)))

coefficients = pipeline.named_steps['classifier'].coef_[0]

feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=False)

print(f"\nTop 5 Most Important Features:")
for _, row in feature_importance.head().iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"• {row['feature']}: {row['coefficient']:.3f} ({direction} churn probability)")

=== TRAINING PRODUCTION PIPELINE ===
Pipeline Performance:
• Accuracy: 0.718
• ROC-AUC: 0.770

Classification Report:
              precision    recall  f1-score   support

    Retained       0.69      0.70      0.70       186
     Churned       0.74      0.73      0.73       214

    accuracy                           0.72       400
   macro avg       0.72      0.72      0.72       400
weighted avg       0.72      0.72      0.72       400


Top 5 Most Important Features:
• months_tenure: -0.857 (decreases churn probability)
• contract_type_monthly: 0.823 (increases churn probability)
• payment_method_electronic_check: 0.437 (increases churn probability)
• contract_type_two_year: -0.430 (decreases churn probability)
• contract_type_annual: -0.413 (decreases churn probability)


### 6.3 Handling New Data - Edge Cases and Robustness

In [21]:
# Create challenging test cases that could break a naive model
edge_cases = pd.DataFrame({
    'age': [25, np.nan, 70, 18, 45],  # Missing value, extreme values
    'income': [30000, 80000, np.nan, 200000, 45000],  # Missing income, very high income
    'months_tenure': [1, 24, np.nan, 0.5, 48],  # Missing tenure, very short tenure
    'contract_type': ['monthly', 'annual', 'unknown_contract', 'two_year', 'monthly'],  # Unknown category
    'payment_method': ['credit_card', 'new_payment_method', 'bank_transfer', 'electronic_check', np.nan],  # Unknown method, missing
    'internet_service': ['fiber', 'dsl', 'satellite', 'no', 'fiber']  # Unknown service type
})

print("=== TESTING PIPELINE ROBUSTNESS ===")
print("Edge cases to test:")
print(edge_cases)

# Pipeline handles all edge cases automatically!
try:
    predictions = pipeline.predict(edge_cases)
    probabilities = pipeline.predict_proba(edge_cases)
    
    print(f"\n✅ Pipeline successfully handled all edge cases!")
    
    print(f"\nPredictions:")
    print(f"{'Case':<6} {'Prediction':<12} {'Churn Prob':<11} {'Confidence'}")
    print("-" * 45)
    
    for i in range(len(edge_cases)):
        pred_label = 'Will Churn' if predictions[i] == 1 else 'Will Stay'
        churn_prob = probabilities[i, 1]
        confidence = max(probabilities[i])
        
        print(f"{i+1:<6} {pred_label:<12} {churn_prob:<11.3f} {confidence:.3f}")
        
except Exception as e:
    print(f"❌ Pipeline failed: {e}")

# Show how pipeline handles missing/unknown values
print(f"\n🔍 HOW PIPELINE HANDLES EDGE CASES:")
print(f"• Missing numeric values → Filled with median")
print(f"• Missing categorical values → Filled with 'missing' category")
print(f"• Unknown categories → Ignored (all zeros in one-hot encoding)")
print(f"• Extreme values → Scaled with same scaler from training")

=== TESTING PIPELINE ROBUSTNESS ===
Edge cases to test:
    age    income  months_tenure     contract_type      payment_method  \
0  25.0   30000.0            1.0           monthly         credit_card   
1   NaN   80000.0           24.0            annual  new_payment_method   
2  70.0       NaN            NaN  unknown_contract       bank_transfer   
3  18.0  200000.0            0.5          two_year    electronic_check   
4  45.0   45000.0           48.0           monthly                 NaN   

  internet_service  
0            fiber  
1              dsl  
2        satellite  
3               no  
4            fiber  

✅ Pipeline successfully handled all edge cases!

Predictions:
Case   Prediction   Churn Prob  Confidence
---------------------------------------------
1      Will Churn   0.929       0.929
2      Will Stay    0.385       0.615
3      Will Stay    0.284       0.716
4      Will Churn   0.526       0.526
5      Will Stay    0.453       0.547

🔍 HOW PIPELINE HANDLES EDGE CA